Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an essential step in data science that helps to understand the dataset better and derive insights from it. Python is a popular language for data science and analysis, and it offers several libraries for EDA. In this article, we will explore how to perform EDA using Python and its libraries.

Importing the Libraries

The first step is to import the necessary libraries. We will be using the following libraries for Exploratory Data Analysis (EDA):

  • Pandas: for data manipulation and analysis.
  • Matplotlib: for data visualization.
  • Seaborn: for statistical data visualisation.
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

Note:

Usage of Pandas is covered in detail in following blog: link.

Loading the Dataset

Next, we need to load the dataset that we want to explore. For this example, we will be using the ‘tips’ dataset from the Seaborn library. The ‘tips’ dataset contains information about the tips given by customers in a restaurant.

tips = sns.load_dataset('tips')
Understanding the Dataset

Before we start exploring the dataset, it’s important to understand its structure and contents. We can do this by using the following methods:

  • head(): displays the first five rows of the dataset.
  • shape: displays the number of rows and columns in the dataset.
  • describe(): provides summary statistics for each numerical column in the dataset.
  • info(): displays the data type and non-null values for each column in the dataset.
# Display the first five rows of the dataset 
print(tips.head()) 

Output:
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
# Display the number of rows and columns in the dataset 
print(tips.shape) 

Output:
(244, 7)
# Display the summary statistics for each numerical column in the dataset 
print(tips.describe()) 

Output:
       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000
# Display the data type and non-null values for each column in the dataset 
print(tips.info())

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None

Visualizing the Dataset

Next, we can start visualizing the dataset to gain a better understanding of its distribution and relationships between variables. We can use the following plots for visualization:

  • Histograms: to visualize the distribution of a numerical variable.
  • Box plots: to visualize the distribution of a numerical variable across different categories.
  • Scatter plots: to visualize the relationship between two numerical variables.
# Histogram of the 'total_bill' variable 
sns.histplot(tips['total_bill'], bins=10) 
plt.show()

Output:
# Box plot of the 'total_bill' variable by 'day' 
sns.boxplot(x='day', y='total_bill', data=tips) 
plt.show()

Output:
# Scatter plot of the 'total_bill' and 'tip' variables 
sns.scatterplot(x='total_bill', y='tip', data=tips) 
plt.show()

Output:

Analysing Relationships

After visualizing the dataset, we can start analyzing the relationships between variables. We can use the following methods for analysis:

  • Correlation: to measure the strength of the linear relationship between two numerical variables.
  • Crosstab: to analyze the relationship between two categorical variables.
# Correlation matrix of the numerical variables
print(tips.corr())

Output:
            total_bill       tip      size
total_bill    1.000000  0.675734  0.598315
tip           0.675734  1.000000  0.489299
size          0.598315  0.489299  1.000000
# Crosstab of the 'sex' and 'smoker' variables
print(pd.crosstab(tips['sex'], tips['smoker']))

Output:
smoker  Yes  No
sex            
Male     60  97
Female   33  54

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top