Exploratory Data Analysis (EDA) is an essential step in data science that helps to understand the dataset better and derive insights from it. Python is a popular language for data science and analysis, and it offers several libraries for EDA. In this article, we will explore how to perform EDA using Python and its libraries.
Importing the Libraries
The first step is to import the necessary libraries. We will be using the following libraries for Exploratory Data Analysis (EDA):
- Pandas: for data manipulation and analysis.
- Matplotlib: for data visualization.
- Seaborn: for statistical data visualisation.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Note:
Usage of Pandas is covered in detail in following blog: link.
Loading the Dataset
Next, we need to load the dataset that we want to explore. For this example, we will be using the ‘tips’ dataset from the Seaborn library. The ‘tips’ dataset contains information about the tips given by customers in a restaurant.
tips = sns.load_dataset('tips')
Understanding the Dataset
Before we start exploring the dataset, it’s important to understand its structure and contents. We can do this by using the following methods:
head()
: displays the first five rows of the dataset.shape
: displays the number of rows and columns in the dataset.describe()
: provides summary statistics for each numerical column in the dataset.info()
: displays the data type and non-null values for each column in the dataset.
# Display the first five rows of the dataset
print(tips.head())
Output: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
# Display the number of rows and columns in the dataset
print(tips.shape)
Output:
(244, 7)
# Display the summary statistics for each numerical column in the dataset
print(tips.describe())
Output:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
# Display the data type and non-null values for each column in the dataset
print(tips.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None
Visualizing the Dataset
Next, we can start visualizing the dataset to gain a better understanding of its distribution and relationships between variables. We can use the following plots for visualization:
- Histograms: to visualize the distribution of a numerical variable.
- Box plots: to visualize the distribution of a numerical variable across different categories.
- Scatter plots: to visualize the relationship between two numerical variables.
# Histogram of the 'total_bill' variable sns.histplot(tips['total_bill'], bins=10) plt.show() Output:

# Box plot of the 'total_bill' variable by 'day'
sns.boxplot(x='day', y='total_bill', data=tips)
plt.show()
Output:

# Scatter plot of the 'total_bill' and 'tip' variables
sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.show()
Output:

Analysing Relationships
After visualizing the dataset, we can start analyzing the relationships between variables. We can use the following methods for analysis:
- Correlation: to measure the strength of the linear relationship between two numerical variables.
- Crosstab: to analyze the relationship between two categorical variables.
# Correlation matrix of the numerical variables print(tips.corr()) Output: total_bill tip size total_bill 1.000000 0.675734 0.598315 tip 0.675734 1.000000 0.489299 size 0.598315 0.489299 1.000000
# Crosstab of the 'sex' and 'smoker' variables
print(pd.crosstab(tips['sex'], tips['smoker']))
Output:
smoker Yes No
sex
Male 60 97
Female 33 54