Enhance Your Plotly Express Scatter Plot With Marginal Plots
Scatter plots are a commonly used data visualisation tool within data science. They allow us to plot two numerical variables, as points, on a two dimensional graph. From these plots, we can understand if there is a relationship between the two variables, and what the strength of that relationship is.
Within this short tutorial, we are going to use the excellent Plotly library to visualise a data set, and we are going to see how to add marginal plots to the edges of the y, and x-axis to enhance our visualisation and understanding of the data.
I have covered creating scatter plots in plotly and matplotlib, which you can find below:
- Creating Scatter Plots of Well Log Data Using matplotlib in Python
- Using Plotly Express to Create Interactive Scatter Plots
Part of this tutorial is covered in my Plotly Scatter Plots video:
The Plotly Library
Plotly is a web-based toolkit that is used to generate powerful and interactive data visualisations. It is very efficient and plots can be generated with very few lines of code. It is a popular library that contains a wide range of charts, including statistical, financial, maps, machine learning, and much more.
The Plotly library can be used in two main ways:
- Plotly Graph Objects, which is a low-level interface for creating figures, traces, and layouts
- Plotly Express, which is a high level wrapper around Plotly Graph Objects. Plotly Express allows users to type much simpler syntax to generate the same plot.
Creating a Scatter Plot With Plotly Express
Loading Libraries & Data
The first step is to load in pandas, which will be used to for loading our data, and plotly.express for viewing the data.
import pandas as pd
import plotly.express as px
Once the libraries have been imported, we can import our data.
The dataset we will be using for this article comes from a Machine Learning competition for lithology prediction that was run by Xeek and FORCE (https://xeek.ai/challenges/force-well-logs/overview). The objective of the competition was to predict lithology from a dataset consisting 98 training wells each with varying degrees of log completeness. The objective was to predict lithofacies based on the log measurements. To download the file, navigate to the Data section of the link above. The original data source can be downloaded at: https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition
df = pd.read_csv('xeek_subset_example.csv')
We can then call upon df
to view the first five and last five rows of the dataframe.
What we get back is the above dataframe. Our dataset contains two well log measurements (RHOB- Bulk Density and NPHI- Neutron Porosity), a Depth curve and a geologically interpreted lithology.
Creating the Scatter Plot
Creating Scatter Plots with Plotly Express is a very simple, we specify the dataframe and the columns we want to plot.
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH')
This returns the following scatter plot. At the moment it looks a little messy as many of the lithologies have overlapping values. This occurs as the interpreted lithology would have been created based on a number of different logging measurements and cuttings descriptions.
Individual LITH groups can be hidden by clicking on the name of the LITH in the legend.
Adding Marginal Plots to a Plotly Express Scatter Plot
Marginal plots are mini plots that can be attached to the margins of the y and x axes. There are four different types of marginal plots available within Plotly Express.
Box Plots
A boxplot is a graphical and standardised way to display the distribution of data based on five key numbers: The “minimum”, 1st Quartile (25th percentile), median (2nd Quartile./ 50th Percentile), the 3rd Quartile (75th percentile), and the “maximum”. The minimum and maximum values are defined as Q1–1.5 * IQR and Q3 + 1.5 * IQR respectively. Any points that fall outside of these limits are referred to as outliers.
Marginal boxplots can be added to a single axes like so
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='box')
Or to both axes by specifying values for marginal_y
and marginal_x
keyword arguments.
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='box', marginal_x='box')
Rug Plot
Rug plots are used to visualise the distribution of data and can be added as follows:
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='rug', marginal_x='rug')
Histograms
Histograms are an excellent data visualisation tool and appear similar to bar charts. However, histograms allow us to gain insights about the distribution of the values within a set of data and allow us to display a large range of data in a concise plot. Within the petrophysics and geoscience domains, we can use histograms to identify outliers and also pick key interpretation parameters. For example, clay volume or shale volume end points from a gamma ray.
To change the marginal plots to histograms, we do so as follows:
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='histogram', marginal_x='histogram')
Violin Plot
Violin plots are similar to boxplots, but they also combine the power of kernel density estimation plots. In addition to illustrating the key statistical points that a boxplot shows, it also allows us to gain an insight into the distribution of the data.
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='violin', marginal_x='violin')
Mixing Marginal Plots
You don’t have to have the same plot on both axes, you can use a histogram on the x-axis and a violin plot on the y-axis.
px.scatter(data_frame=df, x='NPHI', y='RHOB', range_x=[0, 1],range_y=[3, 1], color='LITH',
marginal_y='violin', marginal_x='histogram')
Summary
In this short tutorial, we have seen how to display a variety of marginal plots on a plotly express scatter plot using well log data. These plots can enhance our data visualisations and provide us with further information about the data distribution.
One Comment