Exploring Well Log Data Using Pandas, Matplotlib, and Seaborn

An example of exploring petrophysical and well log measurements using a number of plots from Seaborn and Matplotlib

png — Photo by Markus Spiske on Unsplash

Machine learning and Artificial Intelligence are becoming popular within the geoscience and petrophysics domains. Especially over the past decade. Machine learning is a subdivision of Artificial Intelligence and is the process by which computers can learn and make predictions from data without being explicitly programmed to do so. We can use machine learning in a number of ways within petrophysics, including automating outlier detection, property prediction, facies classification, etc.

This series of articles will look at taking a dataset from basic well log measurements through to petrophysical property prediction. These articles were originally presented at the SPWLA 2021 Conference during a Machine Learning and AI Workshop. They have since been expanded and updated to form these articles. The series will consist of the following, and links will be included once they have been released.

1. Initial Data Exploration of selected wells from the Volve field dataset
2. Identification of missing data
3. Detection of outliers / anomalous data points using manual and automated methods
4. Prediction of key reservoir properties using Machine Learning

Data

In 2018, Equinor released the entire contents of the Volve Field to the public domain to foster research and learning. The released data includes:

Well Logs
Petrophysical Interpretations
Reports (geological, completion, petrophysical, core etc)
Core Measurements
Seismic Data
Geological Models
and more…

The Volve Field is located some 200 km west of Stavanger in the Norwegian Sector of the North Sea. Hydrocarbons were discovered within the Jurassic aged Hugin Formation in 1993. Oil production began in 2008 and lasted for 8 years (twice as long as planned) until 2016, when production ceased. In total 63 MMBO were produced over the field’s lifetime and reached a plateau of 56,000 B/D.

Further details about the Volve Field and the entire dataset can be found at: https://www.equinor.com/en/what-we-do/norwegian-continental-shelf-platforms/volve.html

The data is licensed under the Equinor Open Data Licence.

Selected Data for Analysis

The Volve dataset consists of 24 wells containing a variety of well log data and other associated measurements. For this small tutorial series, we are going to take a selection of five wells. These are:

15/9-F-1 A
15/9-F-1 B
15/9-F-1 C
15/9-F-11 A
15/9-F-11 B

From these wells, a standard set of well logging measurements (features)have been selected. Their names, units, and descriptions are detailed in the table below.

png — Photo by Markus Spiske on Unsplash

The goal over the series of notebooks will be to predict three commonly derived petrophysical measurements: Porosity (PHIF), Water Saturation (SW), and Shale Volume (VSH). Traditionally, these are calculated through a number of empirically derived equations.

Data Exploration

Exploratory Data Analysis (EDA) is an important step within a data science workflow. It allows you to become familiar with your data and understand its contents, extent, quality, and variation. It is within this stage, you can identify patterns within the data and also relationships between the features (well logs).

I have previously covered a number of EDA processes and plots in my previous medium articles:

Exploratory Data Analysis with Well Log Data
Visualising Well Data Coverage Using Matplotlib
How to use Unsupervised Learning to Cluster Well Log Data Using Python

As petrophysicists /geoscientists we commonly use well log plots (line plots with data vs depth), histograms, and crossplots (scatter plots) to analyse and explore well log data. Python provides a great toolset for visualising the data from different perspectives in a quick and easy way.

In this tutorial, we will cover:

Reading in data from a CSV file
Viewing data on a log plot
Viewing data on a crossplot / scatter plot
Viewing data on a histogram
Visualising all well log curves on a crossplot and histogram using a pairplot

Importing Libraries & Data

The first step is to import the libraries that we require. These are, pandas for loading and storing the data, matplotlib and seaborn both for visualising the data.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

After importing the libraries, we will load the data using the pandas read_csv function and assign it to the variable df.

df = pd.read_csv('data/spwla_volve_data.csv')

Pandas .describe Function

Once the data has been loaded it will be stored within a structured object, similar to a table, known as a dataframe. We can check the contents of the dataframe in a number of ways. First, we can check the summary statistics of numeric columns using the .describe() function. From this, we are able to find information out about the number of datapoints per feature, the mean, the standard deviation, minimum, maximum and percentile values.

For the purposes of making the table easier to read, we will append the .transpose() function. This puts the column names in the rows, and the statistical measurements in the columns.

df.describe().transpose()

png — Photo by Markus Spiske on Unsplash

Pandas .info Function

The next method we can call upon is .info(). This provides a list of all of the columns within the dataframe, their data type (e.g, float, integer, string, etc.), and the number of non-null values contained within each column. We can see below, that we have a column called wellName that was not contained in the dataframe shown above.

df.info()


RangeIndex: 27845 entries, 0 to 27844
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   wellName  27845 non-null  object 
 1   MD        27845 non-null  float64
 2   BS        27845 non-null  float64
 3   CALI      27845 non-null  float64
 4   DT        5493 non-null   float64
 5   DTS       5420 non-null   float64
 6   GR        27845 non-null  float64
 7   NPHI      27845 non-null  float64
 8   RACEHM    27845 non-null  float64
 9   RACELM    27845 non-null  float64
 10  RHOB      27845 non-null  float64
 11  RPCEHM    27845 non-null  float64
 12  RPCELM    27600 non-null  float64
 13  PHIF      27736 non-null  float64
 14  SW        27736 non-null  float64
 15  VSH       27844 non-null  float64
dtypes: float64(15), object(1)
memory usage: 3.4+ MB

Pandas .head and .tail Functions

The next useful set of methods available to us is the head() and .tail() functions. These return the first / last five rows of the dataframe

df.head()

png — Photo by Markus Spiske on Unsplash

df.tail()

png — Photo by Markus Spiske on Unsplash

Finding the Name of the Wells Using the .unique Function

We know from the introduction that we should have 5 wells within this dataset. We can check that out by calling upon the wellName column and using the method .unique(). This will return back an array listing all of the unique values within that column.

df['wellName'].unique()

array(['15/9-F-1 A', '15/9-F-1 B', '15/9-F-1 C', '15/9-F-11 A',
       '15/9-F-11 B'], dtype=object)

As seen above, we can call upon specific columns within the dataframe by name. If we do this for a numeric column, such as CALI, we will return a pandas series containing the first 5 values, last 5 values, and details about that column.

df['CALI']

0        8.6718
1        8.6250
2        8.6250
3        8.6250
4        8.6250
          ...  
27840    8.8750
27841    8.8510
27842    8.8040
27843    8.7260
27844    8.6720
Name: CALI, Length: 27845, dtype: float64

Data Visualisation

Well Log Plots

Log plots are one of the bread and butter tools that we use to analyse well log data. They consist of several columns called tracks. Each column can have one or more logging curves within them, plotted against depth. They help us visualise the subsurface and allow us to identify potential hydrocarbon intervals.

As we are going to be creating multiple log plots, we can create a simple function that can be called upon multiple times. Functions allow us to break down our code into manageable chunks and saves the need for repeating code multiple times.

This create_plot function takes a number of arguments (inputs):

wellname: the wellname as a string
dataframe: the dataframe for the selected well
curves_to_plot: a list of logging curves / dataframe columns we are wanting to plot
depth_curve: the depth curve we are wanting to plot against
log_curves: a list of curves that need to be displayed on a logarithmic scale

png — Photo by Markus Spiske on Unsplash

As there are 5 wells within the dataframe, if we try to plot all of that data in one go, we will have mixed measurements from all of the wells. To resolve this, we can create a new dataframe that is grouped by the wellname.

grouped =df.groupby('wellName')

When we call upon the head() function of this new grouped dataframe, we will get the first 5 rows for each well.

grouped.head()

png — Photo by Markus Spiske on Unsplash

To have more control over the well we are wanting to plot, we can split the grouped dataframe into single dataframes and store them within a list. This will allow us to access specific wells by passing in a list index value.

Additionally, it will allow us to use all available pandas dataframe functions on the data, something that is limited and changes when working with a grouped dataframe.

# Create empty lists
dfs_wells = []
wellnames = []

#Split up the data by well
for well, data in grouped:
    dfs_wells.append(data)
    wellnames.append(well)

If we loop through the wellnames list we can get the index number and the associated wellname.

for i, well in enumerate(wellnames):
    print(f'Index: {i} - {well}')

Index: 0 - 15/9-F-1 A
Index: 1 - 15/9-F-1 B
Index: 2 - 15/9-F-1 C
Index: 3 - 15/9-F-11 A
Index: 4 - 15/9-F-11 B

Before we plot the data, we need to specify the curves we are wanting to plot, and also specify which of those curves are logarithmically scaled.

curves_to_plot = ['BS', 'CALI', 'DT', 'DTS', 'GR', 
                  'NPHI', 'RACEHM', 'RACELM', 'RHOB', 
                  'RPCEHM', 'RPCELM', 'PHIF', 'SW', 'VSH']

logarithmic_curves = ['RACEHM', 'RACELM', 'RPCEHM', 'RPCELM']

Let’s call upon the first well and make a plot.

Note that Python lists are indexed from 0, therefore the first well in the list will be at position 0.

well = 0
create_plot(wellnames[well], dfs_wells[well],
            curves_to_plot, dfs_wells[well]['MD'], 
            logarithmic_curves)

When we execute this code, we generate the following plot for 15/9-F-1 A. We have all of our well logging measurements on a single plot, and the resistivity curves are displayed logarithmically, as we would expect them to be.

png — Photo by Markus Spiske on Unsplash

We can do the same with the second well:

well = 1
create_plot(wellnames[well], dfs_wells[well], 
            curves_to_plot, dfs_wells[well]['MD'], 
            logarithmic_curves)

png — Photo by Markus Spiske on Unsplash

Standard Crossplots (Scatter Plots) using Seaborn

Crossplots (also known as scatter plots) are another common data visualisation tool we use during a petrophysical analysis. More information on working with crossplots (scatter plots) and well log data can be found here:

Creating Scatter Plots (Crossplots) of Well Log Data Using Matplotlib in Python

Similar to the log plots section above, we will create a simple function where we can generate multiple crossplots using a simple function. This function utilises the FacetGrid function from Seaborn and allows us to map plots directly on top of the grid. This is a much easier way to plot data compared to subplot2grid in matplotlib.

The arguments (inputs) to this function are:

x — X-axis variable as a string, e.g. ‘NPHI’
y — Y-axis variable as a string, e.g. ‘RHOB’
c — A third variable used for applying colour to the crossplot, e.g. ‘GR’
dataframe — The grouped dataframe created using .groupby('wellName')
columns — The number of columns to display on the figure
xscale — The X-axis scale
yscale — The Y-axis scale
vmin — The minimum value for the colour shading
vmax — The maximum value for the colour shading

png — Photo by Markus Spiske on Unsplash

Evaluating Neutron Porosity & Bulk Density Data Quality Using Borehole Caliper

We can now use our function to create density-neutron crossplots coloured by caliper. Caliper provides an indication of the size of the borehole. During the drilling of the borehole, the walls of the borehole can collapse resulting in the borehole becoming larger.

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

From the plot above, we can see that most of the wells are in good shape and not too significantly washed out, although well 15/9-F11 B contains some borehole enlargement as indicated by the redder colours.

Acoustic Compressional vs Shear Crossplot with Gamma Ray Colouring

The next crossplot we will look at is acoustic compressional (DTC) versus acoustic shear (DTS).

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

When we view this data, we can see that two of the charts are blank. This lets us know right away we may have missing data within our dataset. We will explore this in the next article in the series.

Histograms

matplotlib histograms

Histograms are a commonly used tool within exploratory data analysis and data science. They are an excellent data visualisation tool and appear similar to bar charts. However, histograms allow us to gain insights about the distribution of the values within a set of data and allow us to display a large range of data in a concise plot. Within the petrophysics and geoscience domains, we can use histograms to identify outliers and also pick key interpretation parameters. For example, clay volume or shale volume endpoints from a gamma ray.

Histograms allow us to view the distribution, shape, and range of numerical data. The data is split up into a number of bins, which are represented by individual bars on the plot.

You can find out more about working with histograms and well log data in this article:

Creating Histograms of Well Log Data Using Matplotlib in Python

We can call upon a simple histogram from our main dataframe, simply by appending .hist(column-name) onto the end of the dataframe object.

df.hist('GR')

png — Photo by Markus Spiske on Unsplash

Right away we can see we have a few issues. The first is that all wells are grouped together, the number of bins is too few, and the plot does not look great. Se we can change it up a bit, by first increasing the number of bins and removing the grid lines.

df.hist('GR', bins=40)
plt.grid(False)

png — Photo by Markus Spiske on Unsplash

The above generates an instant improvement to the plot. We can see the distribution of the data much more clearly now, however, all of the data is still combined.

Seaborn Histograms

We can also call upon the Seaborn plotting library, which gives us much more control over the aesthetics of the plot. In the first example, we can add on a Kernel Density Estimate (KDE).

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

From the plot above, we can see that the labels are automatically generated for us, and we have the KDE line plotted as well.

To split the data into the different wells, we can supply another argument: hue which will allow us to use a third variable to split out the data.

If we pass in the wellName for the hue, we can generate separate histograms for each well.

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

We can do the same with Bulk Density (RHOB). We can also add in the number of bins that we are wanting to display.

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

FacetGrid

If we want to split the data up into individual histograms per well, we need to use a FacetGrid and map the required histogram plot to it.

For the FacetGrid we specify the dataframe and the columns we which to split the data into. hue , as mentioned in the crossplot section, controls the colour of the data in each column, and col_wrap specifies the maximum number of columns before the plot wraps to a new row

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

KDEPlot

If we want to view the distribution of the data as a line, we can use the Kernel Density Estimation plot (kdeplot). This is useful if we looking to see if the data requires normalisation.

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

Seaborn Pairplot

Rather than looking at a limited number of variables each time, we can quickly create a grid containing a mixture of crossplots and histograms using a simple line of code from the seaborn library. This is known as a pair plot.

We pass in the dataframe, along with the variables that we want to analyse. Along the diagonal of the pair plot, we can have a histogram, but in this example, we will use the KDE plot. Additionally, we can specify colors, shapes, etc using the plot_kws argument.

png — Photo by Markus Spiske on Unsplash

png — Photo by Markus Spiske on Unsplash

We can now easily see the relationship between each of the well logging measurements rather than creating individual plots. This can be further enhanced by splitting out the data into each well through the use of the hue argument.

png — Photo by Markus Spiske on Unsplash

Summary

In this tutorial, we have used a number of tools to explore the dataset and gain some initial insights into it. This has been achieved through well log plots, crossplots (scatter plots), histograms, and pairplots. These tools allow us to get an initial feel for the data and its contents.

The next step is to identify if there is any missing data present within the dataset. This article will be published soon.

This notebook was originally published for the SPWLA Machine Learning Workshop at the 2021 SPWLA Conference.

Thanks for reading!

If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.

If you want to get in touch you can find me on LinkedIn or at my website.

Interested in learning more about python and well log data or petrophysics? Follow me on Medium.

If you enjoy reading these tutorials and want to support me as a writer and creator, then please consider signing up to become a Medium member. It’s $5 a month and you get unlimited access to many thousands of articles on a wide range of topics. If you sign up using my link, I will earn a small commission with no extra cost to you!

Exploring Well Log Data Using Pandas, Matplotlib, and Seaborn