Well Log Data Outlier Detection With Machine Learning and Python
Well Log Data Outlier Detection With Machine Learning and Python
Identification of outliers is an essential step in the machine learning workflow
Outliers are anomalous points within a dataset. They are points that don’t fit within the normal or expected statistical distribution of the dataset and can occur for a variety of reasons such as sensor and measurement errors, poor data sampling techniques, and unexpected events.
Within well log measurements and petrophysics data, outliers can occur due to washed-out boreholes, tool and sensor issues, rare geological features, and issues in the data acquisition process. It is essential that these outliers are identified and investigated early on in the workflow as they can result in inaccurate predictions by machine learning models.
The example in the figure below (from McDonald, 2021) illustrates core porosity versus core permeability. The majority of the data points form a coherent cluster, however, the point marked by the red square lies outside of this main group of points, and therefore could be considered an outlier. To confirm if it is indeed an outlier, further investigation into the reports and original data would be required.
Identifying Outliers
There are a number of ways to identify outliers within a dataset, some of these involve visual techniques such as scatterplots (e.g. crossplots) and boxplots, whilst others rely on univariate statistical methods (e.g. Z-score) or even unsupervised machine learning algorithms (e.g. K Nearest Neighbours).
The following methods for outlier detection will be covered within this article:
- Manual Removal Based on Domain Knowledge
- Box Plot and IQR
- Using a Caliper Curve
- Automated Outlier Detection
Petrophysical Machine Learning Series
This article forms the third part of an ongoing series that looks at taking a dataset from basic well log measurements through to petrophysical property prediction with machine learning.
These articles were originally presented as interactive notebooks at the SPWLA 2021 Conference during a Machine Learning and AI Workshop. They have since been expanded and updated to form these articles. The series consists of:
- Exploratory Data Analysis: Exploring Well Log Data Using Pandas, Matplotlib, and Seaborn
- Identification and Handling of Missing Well Log Data Prior to Petrophysical Machine Learning
- Well Log Data Outlier Detection — This Article
- Prediction of Key Reservoir Properties Using Machine Learning
**Not Completed Yet**
Data
In 2018, Equinor released the entire contents of the Volve Field to the public domain to foster research and learning. The released data includes:
- Well Logs
- Petrophysical Interpretations
- Reports (geological, completion, petrophysical, core etc)
- Core Measurements
- Seismic Data
- Geological Models
- and more…
The Volve Field is located some 200 km west of Stavanger in the Norwegian Sector of the North Sea. Hydrocarbons were discovered within the Jurassic aged Hugin Formation in 1993. Oil production began in 2008 and lasted for 8 years (twice as long as planned) until 2016, when production ceased. In total 63 MMBO were produced over the field’s lifetime and reached a plateau of 56,000 B/D.
Further details about the Volve Field and the entire dataset can be found at: https://www.equinor.com/en/what-we-do/norwegian-continental-shelf-platforms/volve.html
The data is licensed under the Equinor Open Data Licence.
Selected Data for Analysis
The Volve dataset consists of 24 wells containing a variety of well log data and other associated measurements. For this small tutorial series, we are going to take a selection of five wells. These are:
- 15/9-F-1 A
- 15/9-F-1 B
- 15/9-F-1 C
- 15/9-F-11 A
- 15/9-F-11 B
From these wells, a standard set of well logging measurements (features)have been selected. Their names, units, and descriptions are detailed in the table below.
The goal over the series of notebooks will be to predict three commonly derived petrophysical measurements: Porosity (PHIF), Water Saturation (SW), and Shale Volume (VSH). Traditionally, these are calculated through a number of empirically derived equations.
Importing Libraries & Data
The first step in this part of the project is to import the libraries and the data that we are working with.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy
df = pd.read_csv('data/spwla_volve_data.csv')
An initial look at describe()
function of pandas allows us to see the data range (min and max) for output.
df.describe()
Dealing With Outliers Using Manual Methods
Removing Extreme Resistivities
Resistivity measurements provide an indication of how easy electrical current can pass through the formation. In simple terms, if the formation contains saline water, then the resistivity will be low, whereas if oil is present or there is very little pore space, the resistivity will be high.
There are a number of ways that the resistivity measurement can be affected, for example by nearby casing, tool/sensor issues, or even very highly resistive formations. Additionally, depending upon the type of tool making the measurement (induction, electromagnetic propagation, laterolog) there may be limitations on how accurate the readings are at high resistivity values.
The dataset we are working with contains electromagnetic propagation resistivity measurements, and we will apply the following cutoffs. These limits are not specific to any tool, and will vary depending on the data and technology used.
- RACHEM > 60
- RACLEM > 100
- RPCHEM > 100
- RPCLEM > 200
Any rows that contain resistivity measurements above these values will be removed. If we only removed the data values, we would have issues with missing data later on.
We can do this by:
df = df.loc[~((df.RACEHM > 60) | (df.RACELM > 100) | (df.RPCEHM > 100) | (df.RPCELM > 200)),:]
df.describe()
When we return the dataframe summary we can see that the number of measurements for these resistivity curves has reduced from 27,845 to 23,371
Convert Resistivity Logathrithmic Curves to Normal
As the resistivity curves have a large range of values from 10s of ohmm to 1000s of ohmm and often exhibits a skewed distribution, it would be best to convert the measurements to a more normal distribution by taking the log base 10 (log10) of the values.
# Select all resistivity curves
res_curves = ['RACEHM', 'RACELM', 'RPCEHM', 'RPCELM']
# Loop through each curve and transform it
for res in res_curves:
df[f'{res}_l10'] = np.log10(df[res])
# Drop out the original columns
df.drop(columns=[res],inplace=True)
df.head()
From the above table, we can now see that the new columns have been added to the dataframe and the old ones have been removed.
Identifying Outliers with Boxplots
A boxplot is a graphical and standardised way to display the distribution of data based on five key numbers: The “minimum”, 1st Quartile (25th percentile), median (2nd Quartile./ 50th Percentile), the 3rd Quartile (75th percentile), and the “maximum”. The minimum and maximum values are defined as Q1–1.5 * IQR and Q3 + 1.5 * IQR respectively. Any points that fall outside of these limits are referred to as outliers.
The following video on my YouTube channel covers the background on boxplots and how to generate them in python.
The simplest method for calling a boxplot is using .plot(kind='box)
on the dataframe. However, as you will see, this will plot all of the columns on one scale and if you have a different range of values between the curves, such from 0 to 1, and 3000 to 5000, then the smaller measurements will be harder to distinguish.
df.plot(kind='box')
plt.show()
To make the plots easier to view and understand, we can call upon the boxplot from the seaborn library and loop over each of the columns in the dataframe.
Once we have made the function we can create a list of the columns and “pop-off” or remove the well name column, which contains string data. We can then run the function for each of the log curves within the dataframe.
From the generated plot, we can see that a few of the measurements may contain outliers, which are highlighted in green. Based on the boxplot alone, it is not easy to tell if these points are real outliers. This is where we need to rely on domain knowledge and other methods.
Identifying Outliers with Crossplots
When working with multiple variables we can use crossplots to identify potential outliers. These involve plotting one logging measurement against another. A third logging measurement can be used to add colour to the plot to enhance the easy identification of outliers.
In this example, we will use a function to make a scatter plot (crossplot) of density vs neutron porosity data, which will be coloured by caliper.
We can see on the returned plot that there are a few points highlighted in orange/red which indicates possible washout if we assume an 8.5 inch bitsize.
This indicates that a few measurements may be impacted by badhole conditions. Many logging tools are capable of compensating for a certain degree of washout and rugosity.
For the purposes of this example and to illustrate the process of dealing with bad data points impacted by washout, we will remove any points that are over 9″. Which is 0.5″ over guage. Any points that are less than 8.5″ will also be removed.
We can do this by:
df = df[(df['CALI'] >= 8.5) & (df['CALI'] <= 9)]
df.describe()
We can now see that we have reduced our dataset further to 22,402 depth samples, which is down from 27,845 in the initial dataset.
Identifying Outliers Using Unsupervised Machine Learning Methods
Both supervised and unsupervised machine learning methods can be used to identify outliers within well log data and petrophysical data. For the purposes of this tutorial, we will focus on a few unsupervised learning techniques.
Unsupervised machine learning models attempt to identify underlying relationships within the data without the need for labeled categories. There are a number of unsupervised machine learning methods that can be used to identify anomalies / outliers within a dataset.
In this article we will look at three common methods:
- Isolation Forest (IF)
- One Class SVM (SVM)
- Local Outlier Factor (LOF)
Isolation Forest
The isolation forest method is based upon decision trees. This method selects a feature/measurement and makes a random split in the data between the minimum and maximum values. This process then carries on down the decision tree until all possible splits have been made in the data. Any anomalies/outliers will be split off early in the process making them easy to identify and isolate from the rest of the data.
The image below illustrates a very simple example using a single variable — bulk density (RHOB).
One Class SVM
Support vector machines are a common machine learning tool for classification, which means they are good for splitting data up into different groups based on the data characteristics. This is achieved by identifying the maximum margin hyperplane between the groups of data as seen in the simple multi-class example below.
In a conventional SVM classification, we have more than one class or facies. However, when we only have one class of data we can use what is known as a One Class SVM.
In the case of outlier detection, what we want to do is find the boundary that separates the data points from the origin (see left-hand graph in figure below). Where in this case we treat the origin as the second class. Any points that are outside of the boundary line are considered outliers. We can control the position of the line by providing a value for the number of outliers we want to detect within our dataset. This parameter is known as the contamination level.
But in reality, outliers can exist on any side in relation to the main data cloud. In this case, we want to find a non-linear hyperplane that separates the outliers from the main data points (see right-hand graph in figure below). We can use an RBF kernel (Radial Basis Function Kernel) that finds the non-linear boundary between the points.
We can control how many outliers are allowed to be identified by specifying the contamination level. If we only want a small number of outliers, we can set this to a smaller number. Similarly, if we want a higher number of outliers to be detected, we can provide the algorithm with a higher number for contamination.
Local Outlier Factor
This method assesses the density of data points around a given point. Points that have a low density compared to others will be considered outliers. For further information on this method check out this link:
https://medium.com/mlpoint/local-outlier-factor-a-way-to-detect-outliers-dde335d77e1a
Creating the Models
We can create the models very simply using the code below, which is done using the scikit learn library.
We can now check the performance of each of the models using density-neutron crossplots. This is achieved using Seaborn’s FacetGrid and mapping a scatter plot to it as follows.
This returns the following plots for each method and the number of anomalous data points that were identified.
It appears that the IF method is providing a better result, followed by LOF and then SVM. Most of the outliers on the right-hand side of the plot are removed in the first two methods.
We can look at the data in more detail for each of the well for each of the methods.
Which returns the following plots.
This gives us a better idea of how the anomalies have been identified in each of the wells using the selected methods. We can see that the LOF method has highlighted a number of points within the centre of the data points, which may not be true outliers. This is the point where domain expert knowledge comes into play in confirming if the points are indeed outliers.
Displaying Outlier Points on Log Plots
To identify where the outliers have been detected, we can generate simple log plots for each method and for each well using the code below.
Before plotting the data we can make things easier by splitting up the dataframe into multiple dataframes based on the well name
Once the dataframe has been split up by well, we can use the index number of the list to access the required data.
If we loop through the well names list we can get the index number and the associated well name.
Which returns:
Wellname Index
15/9-F-1 A 0
15/9-F-1 B 1
15/9-F-1 C 2
15/9-F-11 A 3
15/9-F-11 B 4
We can then use this to select the required well
The plots above show us where the anomalies/potential outliers exist on a conventional log plot. The interval from 3700 to 3875 is an interval that contains missing values. As we are using a line plot the points on either side of this gap are interpolated to create a line.
The highlighted intervals will need to be evaluated in closer detail and with domain knowledge, but for the purposes of this tutorial, we will go with the results of the Isolation Forest method.
Exporting the Results
Now that we have a clean dataset, we can export it to the required files for our machine learning stage.
To do this, we first need to create our temporary dataframe where we only use data that has been identified as inliers by the Isolation Forest algorithm. Once these points have been removed, we can then create our output dataframe.
Creating Supervised Learning Files
To prepare the Supervised Learning file, we need to carry out a few steps.
Train, Validate, and Test Split
Before exporting our data, we first need to split the data into training, validation, and test datasets.
The training dataset is used for training the model, the validation dataset is used to tune the model parameters, and the test dataset is used to verify the model on unseen data. It should be noted that these terms are used interchangeably and can cause confusion.
Test Data Separation
First, we will split off one well (15/9-F11 B) for our testing dataset, and the remainder of the wells and data will be assigned to training_val_wells
Training and Validation Datasets
Next, we split our training_val_wells
data up into the training features and target features (SW, PHIF & VSH). In situations where we are using Sklearn for prediction, we would typically only specify one feature for prediction. However, for the example, we will be setting three target features.
When we check the training features (X), we can see we have all of the required columns.
X.head()
We can also check the head of the target features (y):
y.head()
Train-Test Split
Sklearn’s train_test_split module. For this example, we will use a training size of 70%, which leaves 30% for validating and tuning the model. random_state
has been set to a fixed value to allow reproducible results for this workshop. If you want to randomise the values each time, you can remove this method input.
It should be noted that the method used here is called train_test_split. But what we are actually doing is splitting off the training and validation datasets.
Summary
In this article we have covered what outliers are and methods for identifying them using plots and unsupervised learning algorithms. Once the outliers have been identified they can be removed. This is done using a combination of the methods and domain expertise. If points are thrown away blindly, you could be throwing away valuable data. So always check that your identified outliers are really outliers.
This notebook was originally published for the SPWLA Machine Learning Workshop at the 2021 SPWLA Conference, and has since been modified.
Thanks for reading!
If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.
If you want to get in touch you can find me on LinkedIn or at my website.
Interested in learning more about python and well log data or petrophysics? Follow me on Medium.
If you enjoy reading these tutorials and want to support me as a writer and creator, then please consider signing up to become a Medium member. It’s $5 a month and you get unlimited access to many thousands of articles on a wide range of topics. If you sign up using my link, I will earn a small commission with no extra cost to you!