Table of Contents
Data exploration and pre-processing are important steps within any data science or machine learning workflow. When working on tutorial or training datasets it can be the case that they have been engineered in a way to make it easy to work with and allow the algorithm being discussed to be run successfully. However, in the real world, data is messy! It can have erroneous values, incorrect labels and parts of it can be missing.
Missing data is probably one of the most common issues when working with real datasets. Data can be missing for a multitude of reasons, including sensor failure, data vintage, improper data management, and even human error. Missing data can occur as single values, multiple values within one feature, or entire features may be missing.
It is important that missing data is identified and handled appropriately prior to further data analysis or machine learning. Many machine learning algorithms can’t handle missing data and require entire rows, where a single missing value is present, to be deleted or replaced (imputed) with a new value.
Depending on the source of the data missing values may be represented in different ways. The most common is NaN (Not a Number), however, other variations can include “NA”, “None”, “-999”, “0”, “ ”, “-”. If the missing data is represented by something other than NaN in your dataframe, then it should be converted to NaN using np.NaN as seen below.
The missingno Library
Missingno is an excellent and simple to use Python library that provides a series of visualisations to understand the presence and distribution of missing data within a pandas dataframe. This can be in the form of either a barplot, matrix plot, heatmap, or a dendrogram. The original publication for the library can be found here.
From these plots, we can identify where missing values occur, the extent of the missingness and whether any of the missing values are correlated with each other. Often, missing values may be seen as not contributing any information, but if analysed closely there may be an underlying story.
The missingno library can be installed using the pip command:
pip install missingno
For this tutorial, we will use a subset of the publicly available dataset from the Machine Learning competition run by Xeek and FORCE 2020. The objective of the competition was to predict lithology from existing labeled data. The dataset consists of 118 wells from the Norwegian Sea.
The data contains a series of electrical measurements that have been acquired by well logging tools. The measurements are used to characterise the subsurface geology and identify suitable hydrocarbon reservoirs.
The data and notebook for this article can be found on my GitHub repository at https://github.com/andymcdgeo/missingno_tutorial
Importing Libraries and Loading the Data
import pandas as pd import missingno as msno df = pd.read_csv('xeek_train_subset.csv')
Quick Analysis with Pandas
Before we use the missingno library, there are a few features within the pandas library that can give us an initial insight into how much missing data.
The first is using the
.describe() method. This returns a table containing summary statistics about the dataframe such as the mean, maximum and minimum values. At the top of the table is a row called counts. In the example below, we can see that we having varying counts for each of the features within the dataframe. This provides an initial indication that not all values are present.
We can take this one step further and use the
.info() method. This will return back a summary of the dataframe as well as a count of the non-null values.
We can see from the example above that we have a more concise summary of the state of the data and the extent of data missingness.
Another quick method we can use is:
This returns a summary of how many missing values are contained within the dataframe. The isna() part detects missing values within the dataframe and returns a Boolean value for each element in the dataframe. The sum() part sums up the number of True values.
The following information is returned by this line.
From this summary, we can see that a number of columns, namely WELL, DEPTH_MD, GROUP, GR, and LITHOFACIES have no null values. All others have a large and varying degree of missing values.
Using missingno to Identify Missing Data
Within the missingno library, there are four types of plots for visualising data completeness: the barplot, the matrix plot, the heatmap, and the dendrogram plot. Each has its own advantages for identifying missing data.
Let’s take a look at each of these in turn.
The barplot provides a simple plot where each bar represents a column within the dataframe. The height of the bar indicates how complete that column is, i.e, how many non-null values are present. It can be generated by calling upon:
On the left side of the plot, the y-axis scale ranges from 0.0 to 1.0, where 1.0 represents 100% data completeness. If the bar is less than this, it indicates that we have missing values within that column.
On the right side of the plot, the scale is measured in index values. With the top right representing the maximum number of rows within the dataframe.
Along the top of the plot, there are a series of numbers that represent the total count of the non-null values within that column.
In this example we can see that a number of the columns (DTS, DCAL and RSHA) have a large amount of missing values. Other columns (e.g. WELL, DEPTH_MD and GR) are complete and have the maximum number of values.
The matrix plot is a great tool if you are working with depth-related data or time-series data. It provides a colour fill for each column. When data is present, the plot is shaded in grey (or your colour of choice), and when it is absent the plot is displayed in white.
The matrix plot can be generated by calling upon:
As seen in the resultant plot, the columns DTS, DCAL, and RSHA show large portions of missing data. This was identified in the bar plot, but the added benefit is you can view how that missing data is distributed in the dataframe.
On the right side of the plot is a sparkline that ranges from 0 on the left to the total number of columns in the dataframe on the right. A closeup can be seen above. When a row has a value in each column, the line will be at the maximum right position. As missing values start to increase within that row the line will move towards the left.
The heatmap is used to identify correlations of the nullity between each of the different columns. In other words, it can be used to identify if there is a relationship in the presence of null values between each of the columns.
Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column.
Values close to negative 1 indicate that the presence of null values in one column is anti-correlated with the presence of null values in another column. In other words, when null values are present in one column, there are data values present in the other column, and vice versa.
Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.
There are a number of values that show as <-1. This indicates that the correlation is very close to being 100% negative.
The heatmap can be generated by the following code:
Here we can see that the ROP column is slightly negatively correlated with the RHOB, NPHI and PEF columns, and slightly positively correlated with RSHA. If we take a look at DRHO, its absence is highly correlated with missing values in the RHOB, NPHI and PEF columns.
The heatmap approach is more suitable for smaller datasets.
The dendrogram plot provides a tree-like graph generated through hierarchical clustering and groups together columns that have strong correlations in nullity.
If a number of columns are grouped together at level zero, then the presence of nulls in one of those columns is directly related to the presence or absence of nulls in the others columns. The more separated the columns in the tree, the less likely the null values can be correlated between the columns.
The dendrogram can be generated by:
In the dendrogram plot above, we can see we have two distinct groups. The first is on the right side (DTS, RSHA, and DCAL) which all have a high degree of null values. The second is on the left, with the remainder of the columns which are more complete.
LITHOFACIES, GR, GROUP, WELL, and DEPTH_MD are all grouped together at zero indicating that they are complete.
RDEP, Z_LOC, X_LOC, and Y_LOC are grouped together close to zero. RMED is in the same larger branch suggesting that some of the missing values present within that column can be correlated with these four columns.
Identifying missing prior to applying machine learning is a key component of the data quality workflow. This can be achieved using the missingno library and a series of visualisations to understand how much missing data is present, where it occurs, and how the occurrence of missing values is related between the different data columns.
Thanks for reading!
If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.
Interested in learning more about python and well log data or petrophysics? Follow me on Medium.
If you have enjoyed this article or any others and want to show your appreciation you are welcome to Buy Me a Coffee
Bilogur, (2018). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547, https://doi.org/10.21105/joss.00547
Bormann, Peter, Aursand, Peder, Dilib, Fahad, Manral, Surrender, & Dischington, Peter. (2020). FORCE 2020 Well well log and lithofacies dataset for machine learning competition [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4351156