Fast and effective EDA with the Pandas Profiling Library
Exploratory Data Analysis (EDA) is an important and essential part of the data science and machine learning workflow. It allows us to become familiar with our data by exploring it, from multiple angles, through statistics, data visualisations, and data summaries. This helps discover patterns in the data, spot outliers, and gain a solid understanding of the data we are working with.
In this short Python EDA tutorial, we will cover the use of an excellent Python library called Pandas Profiling. This library helps us carry fast and automatic EDA on our dataset with minimal lines of code.
Within this article we will cover:
- What is Exploratory Data Analysis (EDA)?
- What is the Python Pandas Profiling library?
- How to use the Pandas Profiling library for Exploratory Data Analysis
There is also a video version of this tutorial on my YouTube channel which can be viewed below:
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis, EDA for short, is simply a ‘first look at the data’. It forms a critical part of the machine learning workflow and it is at this stage we start to understand the data we are working with and what it contains. In essence, it allows us to make sense of the the data before applying advanced analytics and machine learning.
During EDA we can begin to identify patterns within the data, understand relationships between features, identify possible outliers that may exist within the dataset and identify if we have missing values. Once we have gained an understanding of the data and we can then check whether further processing is required or if data cleaning is necessary.
When working with Python or if you are working through a Python training course, you will typically carry out EDA on your data using pandas and matplotlib. Pandas has a number of functions including
df.info() which help summarise the statistics of the dataset, and matplotlib has a number of plots such as barplots, scatter plots and histograms to allow us to visualise our data.
When working with machine learning or data science training datasets the above methods may be satisfactory as much of the data has already been cleaned and engineered to make it easier to work with. In real world datasets, data is often dirty and requires cleaning. This can be a time consuming task to check using the methods above. This is where auto EDA can come to the rescue and help us speed up this part of the workflow without compromising on quality.
What is the Pandas Profiling Python Library?
Pandas Profiling is a Python library that allows you to generate a very detailed report on our pandas dataframe without much input from the user. It
According to PyPi Stats, the library has over 1,000,000 downloads each month, which proves its a very popular library within data science.
Installing Pandas Profiling
To install Pandas Profiling you can use the following commands:
If using PIP:
pip install pandas-profiling
If using Anaconda’s Conda Package Manager:
conda env create -n pandas-profiling
conda activate pandas-profiling
conda install -c conda-forge pandas-profiling
The dataset we are using for this tutorial comes from the Australian Government’s National Offshore Petroleum Management System (NOPIMS).
It contains a series of well log measurements that have been acquired by scientific instruments that are use to evaluate and characterise the geology and petrophysical nature of the subsurface.
Do not worry about the nature of the data as the techniques described below can be applied to any dataset.
The first step is to import the libraries we are going to be working with (Pandas and Pandas Profiling) like so:
import pandas as pd
from pandas_profiling import ProfileReport
Loading the Dataset
Next we load in the data we are going to explore. In this case our data is stored within a csv file, which needs to be read in using pd.read_csv like so:
df = pd.read_csv('data/NOPIMS_Australia/Ironbank-1.csv', na_values=-999)
As our data contains null / missing values represented by -999, we can tell pandas to set these values to Not a Number (NaN).
Running Pandas Profiling
To generate the report we first create a variable called report and assign
ProfileReport() to that. Within the parentheses we pass in the dataframe, in this case
We can then call upon report and begin the process.
report = ProfileReport(df)
When we run this cell the report process will be kicked off and analyse all of your data within the dataframe.
The length of time will be dependent on the size of your data and larger datasets will take longer to complete.
Understanding Pandas Profiling Results
The overview section contains three tabs: Overview, Warnings and Reproduction.
The Overview tab provides statistical information about your dataset including the number of variables (columns in the dataframe), number of observations (total number of rows), how many values are missing along with the percentage, how many duplicates there are, and the file size.
Within the variables section of the report, we can view the detailed statistics of each of the columns contained within the dataframe. This includes how many missing values there are, the statistics of the data (mean, minimum and maximum), and more.
On the right-hand side of each section, we can see a histogram of the data distribution. This gives us an indication of the skewness of the data, as well as its spread.
The interactions section of the report allows you to plot one variable against another in order to understand how they relate to each other.
The correlations section allows us to understand the degree at which two variables are correlated with one another. Within the pandas_profile report, we can view different methods of correlation:
- Spearman’s ρ
- Pearson’s r
- Kendall’s τ
- Phik (φk)
If you are unsure what each method is, you can click on the button “Toggle Correlation Descriptions” and it will provide details of the meaning of each method.
We can also gain a good understanding of how complete our dataset is. This is similar to the functionality provided by the missingno Python library.
We can view the data using 4 types of plot:
- The count plot provides a count of the total values present.
- The matrix plot gives an indication of where the missing values are within the dataframe.
- The heatmap plot gives us an indication of how the null values correlate between variables.
- The dendrogram is a tree-like graph, which shows how much null values are correlated between the variables. Groups that are closer together indicate a strong correlation in nullity.
Finally, the sample section allows us to view the raw numbers of the dataset for the first 10 rows and last 10 rows. This is the equivalent of running
The pandas-profiling Python library is a great tool for quickly analysing your dataset without the need to spend significant time remembering and writing code with pandas and matplotlib. Definitely check it out for your next project.