How to Use Unsupervised Clustering on Well Log Data with Python
How to Use Unsupervised Clustering on Well Log Data with Python
Subdividing the subsurface using Python
Understanding the subsurface lithology is an important task in geoscience and petrophysics. Using a variety of electrical measurements generated from well logging technology we are able to make inferences about the underlying geology, such as the lithology, facies, porosity, and permeability.
Machine Learning algorithms have routinely been adopted to group well log measurements into distinct lithological groupings, known as facies. This process can be achieved using either unsupervised learning or supervised learning algorithms.
Supervised learning is the most common and practical of machine learning tasks and it is designed to learn from example using input data that has been mapped to the “correct” output. Alternatively, we can run the modelling using Unsupervised Learning, where we let the algorithms identify underlying patterns within the data that may not be easily visible during data exploration.
In this tutorial, we will be carrying out unsupervised learning classification using two clustering methods (K Means Clustering and Gaussian Mixture Modelling ) and comparing the results with an established Lithofacies curve.
What is Clustering / Cluster Analysis?
Clustering of data is a common form of exploratory data analysis (EDA) which is used to divide up the data into different groups based on shared characteristics or properties. Data points that are similar to each other are grouped together in the same cluster, and those that are different are placed in another cluster.
K-Means clustering is a very commonly used unsupervised machine learning algorithm. It is used to group data into K number of clusters by minimising the distance between the data point and the centroid.
The centroid is initialised at k random points in the data space and all points around it are assigned to the relevant cluster based on the distance to the centroid. The centroid is then adjusted to the central point of the cluster and the points surrounding it are reassigned. This continues until either there is no change in the centroids or the points remain in the same cluster or until a maximum number of iterations is reached.
K-Means is a hard clustering method where a data point either belongs to a cluster or it does not. It also carries out clustering by applying a circle (or hyper-sphere in multi-dimensional datasets)to the data.
Gaussian Mixture Modelling
The GMM method also allows data points to be clustered, except that it accounts for data variance, results in a softer classification, and rather than being distance-based it is distribution-based.
Also, the data point being classified has a probability of being one cluster or another.
While K-Means clustering works great if the data clusters are circular, however, in petrophysical and geological situations data rarely forms nice circular patterns. GMM modelling uses eliptical shaped cluster/decision boundaries and is, therefore, more flexible.
An excellent article looking at the differences between the two methods can be found at https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/
The dataset we are using for this tutorial forms part of a Machine Learning competition run by Xeek and FORCE 2020. The objective of the competition was to predict lithology from existing labelled data. The dataset consists of 118 wells from the Norwegian Sea.
The data and notebook for this article can be found on my GitHub repository at https://github.com/andymcdgeo/Petrophysics-Python-Series.
Importing Libraries & Data Loading
The first step of the project is to import the libraries that we require. For this example, we will be using NumPy for working with arrays, pandas for storing data, seaborn and matplotlib for displaying the data.
We will then read in the data using
pd.read_csv and then view the data details using the
As it is quite a big table, we can view the columns by calling upon
This will return an array of the columns within the dataframe:
Index(['WELL', 'DEPTH_MD', 'X_LOC', 'Y_LOC', 'Z_LOC', 'GROUP', 'FORMATION', 'CALI', 'RSHA', 'RMED', 'RDEP', 'RHOB', 'GR', 'SGR', 'NPHI', 'PEF', 'DTC', 'SP', 'BS', 'ROP', 'DTS', 'DCAL', 'DRHO', 'MUDWEIGHT', 'RMIC', 'ROPA', 'RXO', 'FORCE_2020_LITHOFACIES_LITHOLOGY', 'FORCE_2020_LITHOFACIES_CONFIDENCE'], dtype='object')
We don’t need all of the columns for this example, so we will take a copy of the dataframe with the required logging measurements, including the well name and the depth curve.
We will also rename the
FORCE_2020_LITHOFACIES_LITHOLOGY column to something simpler like
When we call upon the dataframe we have a much smaller number of columns to work with.
Column Remapping / Renaming
To make things simpler with plotting, and understanding what the lithology numbers supplied with the data mean, we can create two dictionaries and map the
FACIES column to two new columns.
The first is creating a dictionary for the string representations of the numbers.
The second dictionary is simplifying the lithology number to range from 1 to 12 instead of the large range of numbers used in the original data. This will help when it comes to making a log plot.
To create new columns in our dataframe, we start with defining the column name
workingdf['LITH'] and then we assign the mapped values using the
We do this for both the string and simple number representations of the facies.
When we view the dataframe, we can see that we now have our two new columns at the end.
Visualising the Data
As there is already a FACIES column with this data, we can take a quick look to see how the data is distributed across each lithofacies.
To do this we can use Seaborn’s
FacetGrid method to plot individual density-neutron crossplots (scatterplots) for each lithology.
The FacetGrid is used to create an underlying structure for the plot. In this example, the FacetGrid has been passed the dataframe we are working with (
workingdf), the column we want to split the plots up by (
col) and the point at which we want to wrap to a new row (
col_wrap). In this instance, once there are 4 columns, then the data will wrap.
We can then map a density neutron crossplot on top of that
Create Plot Function
Before we plot any data we need to create a few functions. The first is a create plot function, which will take a number of arguments and our facies curve, and will generate a conventional log plot.
Splitting Data by Well Function
The second method we will create will be used to split up our dataframe by wells. This is done using the
groupby function, and will allow us to store each dataframe within a list for easy access later.
We can then call upon the function:
This then returns a nice listing of the index positions and the associated well names
4 16/1-6 A
9 16/11-1 ST3
10 16/2-11 A
In this section, we are going to set up our clustering models and run them on our dataset.
First, we will import our clustering models from the sklearn library.
Finding the Optimum Number of Clusters
To make sure that K-Means and Gaussian Mixture Modelling models are working efficiently we need to provide them with a starting number of clusters. If the number of clusters is incorrectly selected, the algorithms may not perform well or could take longer to resolve (especially if the number is too high).
We can attempt to identify the optimum number of clusters using an elbow plot, where the goal is to select a number for the clusters based on the ‘elbow’ or inflection formed in the results. There are other methods such as the silhouette method for picking the number of clusters.
For this example, we will use the elbow plot. To do this we evaluate the model performance over a given range of clusters, and then from the plot identify the most suitable number.
For clustering to work, we need to remove any missing values. This is achieved using the
We can then use the
describe() function to make sure our data is still good after the missing data values have been removed. In this example, we have gone from 133,198 to 82,732 depth levels.
To keep our model simple we will work with four logging measurements (columns): Gamma Ray (GR), Bulk Density (RHOB), Neutron Porosity (NPHI), and Acoustic Compressional Slowness (DTC).
In the plot above, we can see that the inertia (sum of the squared distances to the nearest cluster center) decreases as we increase the number of clusters. There is no clear defined break within this dataset, however, we can see that the slope changes from about 5 clusters onwards. The picking of this value will be dependent on the interpreter and could range from 4 to 10.
So for this example we will take 5 as the optimum number of clusters.
Fitting the Clustering Models
To make the comparison simple, we will use the same number of clusters in the Gaussian Mixture Model. For this model the number of clusters parameter is known as n_components.
Plotting the Results
Now that the clusters have been computed using KMeans and GMM methods, we can plot the data to see how well the predicted in relation to the labeled lithologies. Note that these methods are unsupervised and do not use the labeled data for training. We are comparing here how well unsupervised methods perform with well log data.
As we predicted into the main
workingdf dataframe, we need to split the data up again into individual wells. We can do this by calling upon the simple function created earlier.
3 16/1-6 A
8 16/2-11 A
The first plot we will look at is the logplot. We will pass in the original lithofacies (LITH_SI) column and the newly computed KMeans and GMM results.
In the plot above we have the original Lithology and our computed KMeans and GMM cluster results in the last three subplots.
The first thing to note is that the colours do not necessarily mean the data is from the same group across each method. There are ways to map these so that the colouring is consistent, but for the purposes of this tutorial, we will not go into this.
Looking at well 16/10–1 (index 4), we have 10 separate facies/groups displayed and we can see that these mostly tie-up with the changes in the logging measurements. For example the decrease in Gamma Ray (GR) from around 2300m to around 2775m ties in nicely with the blue and light blue grouping. In the KMeans and GMM models, this section has also been highlighted as being in the same cluster in both methods, however, there is no variation in this section. As both of these methods were set to a max of 5 clusters, we will not be able to capture the same degree of variation.
To resolve this, we could increase the number of clusters.
Viewing Results on Scatterplots / Crossplots
Another way to view the performance of the clustering is through scatter plots. We can do this using the common density-neutron scatterplots / crossplots and using matplotlib.
Even though there is mixing of the clusters in each method, the interval that was discussed in the log plot section can be identified in the lower left of the plot, where we have higher density values and lower neutron porosity values.
In the KMeans grouping, this cluster shows as one complete cluster, however, in the GMM method we can see it matches closer to the supplied lithology.
Viewing Results on a Pairplot
As we used four input curves for our model, we should look at all of these to see how the clusters vary. The best way to do this is to use the excellent pairplot from the seaborn library. This plot displays the relationships between the data in the dataset on a grid. This allows a quick and easy way to identify and visualise the data. Along the diagonal, the distribution of the data split by cluster is also plotted.
As we are looking at well number 4 we need to pass in that dataframe to the pairplot (dfs_wells).
This provides us a much nicer plot to look at and also allows us to see how the data is clustered in the other logging curves. We can see that the GMM model provides some improvement in defining the clusters, especially in the DTC vs RHOB plot.
In this article, we have covered the basics for carrying out unsupervised cluster analysis using two popular algorithms — KMeans Clustering and Gaussian Mixture Modelling. Using an optimisation method we have determined, by eye, that the optimum number of clusters was five, however, it is worth experimenting with more clusters to see if this provides a better match.
Once the clustering was complete, we saw multiple ways to visualise the results: a standard log plot setup, scatter plots, and seaborn’s pairplot.
As K-Means clustering utilises spherical clusters, it may not always be appropriate to well log data and the subsurface. However, Gaussian Mixture Modelling does appear to provide a slight improvement in clustering.
Thanks for reading!
If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.
If you want to get in touch you can find me on LinkedIn or at my website.
Interested in learning more about python and well log data or petrophysics? Follow me on Medium.
If you have enjoyed this article or any others and want to show your appreciation you are welcome to Buy Me a Coffee
Bormann, Peter, Aursand, Peder, Dilib, Fahad, Manral, Surrender, & Dischington, Peter. (2020). FORCE 2020 Well well log and lithofacies dataset for machine learning competition [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4351156
The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/