Well Log Data Outlier Detection With Machine Learning and Python

Well Log Data Outlier Detection With Machine Learning and Python

Identification of outliers is an essential step in the machine learning workflow

png
Photo by Will Myers on Unsplash

Outliers are anomalous points within a dataset. They are points that don’t fit within the normal or expected statistical distribution of the dataset and can occur for a variety of reasons such as sensor and measurement errors, poor data sampling techniques, and unexpected events.

Within well log measurements and petrophysics data, outliers can occur due to washed-out boreholes, tool and sensor issues, rare geological features, and issues in the data acquisition process. It is essential that these outliers are identified and investigated early on in the workflow as they can result in inaccurate predictions by machine learning models.

The example in the figure below (from McDonald, 2021) illustrates core porosity versus core permeability. The majority of the data points form a coherent cluster, however, the point marked by the red square lies outside of this main group of points, and therefore could be considered an outlier. To confirm if it is indeed an outlier, further investigation into the reports and original data would be required.

png
Photo by Will Myers on Unsplash

Identifying Outliers

There are a number of ways to identify outliers within a dataset, some of these involve visual techniques such as scatterplots (e.g. crossplots) and boxplots, whilst others rely on univariate statistical methods (e.g. Z-score) or even unsupervised machine learning algorithms (e.g. K Nearest Neighbours).

The following methods for outlier detection will be covered within this article:

  • Manual Removal Based on Domain Knowledge
  • Box Plot and IQR
  • Using a Caliper Curve
  • Automated Outlier Detection

Petrophysical Machine Learning Series

This article forms the third part of an ongoing series that looks at taking a dataset from basic well log measurements through to petrophysical property prediction with machine learning.

These articles were originally presented as interactive notebooks at the SPWLA 2021 Conference during a Machine Learning and AI Workshop. They have since been expanded and updated to form these articles. The series consists of:

  1. Exploratory Data Analysis: Exploring Well Log Data Using Pandas, Matplotlib, and Seaborn
  2. Identification and Handling of Missing Well Log Data Prior to Petrophysical Machine Learning
  3. Well Log Data Outlier Detection — This Article
  4. Prediction of Key Reservoir Properties Using Machine Learning 
    **Not Completed Yet**

Data

In 2018, Equinor released the entire contents of the Volve Field to the public domain to foster research and learning. The released data includes:

  • Well Logs
  • Petrophysical Interpretations
  • Reports (geological, completion, petrophysical, core etc)
  • Core Measurements
  • Seismic Data
  • Geological Models
  • and more…

The Volve Field is located some 200 km west of Stavanger in the Norwegian Sector of the North Sea. Hydrocarbons were discovered within the Jurassic aged Hugin Formation in 1993. Oil production began in 2008 and lasted for 8 years (twice as long as planned) until 2016, when production ceased. In total 63 MMBO were produced over the field’s lifetime and reached a plateau of 56,000 B/D.

Further details about the Volve Field and the entire dataset can be found at: https://www.equinor.com/en/what-we-do/norwegian-continental-shelf-platforms/volve.html

The data is licensed under the Equinor Open Data Licence.

Selected Data for Analysis

The Volve dataset consists of 24 wells containing a variety of well log data and other associated measurements. For this small tutorial series, we are going to take a selection of five wells. These are:

  • 15/9-F-1 A
  • 15/9-F-1 B
  • 15/9-F-1 C
  • 15/9-F-11 A
  • 15/9-F-11 B

From these wells, a standard set of well logging measurements (features)have been selected. Their names, units, and descriptions are detailed in the table below.

png
Photo by Will Myers on Unsplash

The goal over the series of notebooks will be to predict three commonly derived petrophysical measurements: Porosity (PHIF), Water Saturation (SW), and Shale Volume (VSH). Traditionally, these are calculated through a number of empirically derived equations.

Importing Libraries & Data

The first step in this part of the project is to import the libraries and the data that we are working with.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy
df = pd.read_csv('data/spwla_volve_data.csv')

An initial look at describe() function of pandas allows us to see the data range (min and max) for output.

df.describe()
png
Photo by Will Myers on Unsplash

Dealing With Outliers Using Manual Methods

Removing Extreme Resistivities

Resistivity measurements provide an indication of how easy electrical current can pass through the formation. In simple terms, if the formation contains saline water, then the resistivity will be low, whereas if oil is present or there is very little pore space, the resistivity will be high.

There are a number of ways that the resistivity measurement can be affected, for example by nearby casing, tool/sensor issues, or even very highly resistive formations. Additionally, depending upon the type of tool making the measurement (induction, electromagnetic propagation, laterolog) there may be limitations on how accurate the readings are at high resistivity values.

The dataset we are working with contains electromagnetic propagation resistivity measurements, and we will apply the following cutoffs. These limits are not specific to any tool, and will vary depending on the data and technology used.

  • RACHEM > 60
  • RACLEM > 100
  • RPCHEM > 100
  • RPCLEM > 200

Any rows that contain resistivity measurements above these values will be removed. If we only removed the data values, we would have issues with missing data later on.

We can do this by:

df = df.loc[~((df.RACEHM > 60) | (df.RACELM > 100) | (df.RPCEHM > 100) | (df.RPCELM > 200)),:]
df.describe()

When we return the dataframe summary we can see that the number of measurements for these resistivity curves has reduced from 27,845 to 23,371

png
Photo by Will Myers on Unsplash

Convert Resistivity Logathrithmic Curves to Normal

As the resistivity curves have a large range of values from 10s of ohmm to 1000s of ohmm and often exhibits a skewed distribution, it would be best to convert the measurements to a more normal distribution by taking the log base 10 (log10) of the values.

# Select all resistivity curves
res_curves = ['RACEHM', 'RACELM', 'RPCEHM', 'RPCELM']

# Loop through each curve and transform it
for res in res_curves:
df[f'{res}_l10'] = np.log10(df[res])

# Drop out the original columns
df.drop(columns=[res],inplace=True)

df.head()
png
Photo by Will Myers on Unsplash

From the above table, we can now see that the new columns have been added to the dataframe and the old ones have been removed.

Identifying Outliers with Boxplots

A boxplot is a graphical and standardised way to display the distribution of data based on five key numbers: The “minimum”, 1st Quartile (25th percentile), median (2nd Quartile./ 50th Percentile), the 3rd Quartile (75th percentile), and the “maximum”. The minimum and maximum values are defined as Q1–1.5 * IQR and Q3 + 1.5 * IQR respectively. Any points that fall outside of these limits are referred to as outliers.

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

The following video on my YouTube channel covers the background on boxplots and how to generate them in python.

png
Photo by Will Myers on Unsplash

The simplest method for calling a boxplot is using .plot(kind='box) on the dataframe. However, as you will see, this will plot all of the columns on one scale and if you have a different range of values between the curves, such from 0 to 1, and 3000 to 5000, then the smaller measurements will be harder to distinguish.

df.plot(kind='box')
plt.show()
png
Photo by Will Myers on Unsplash

To make the plots easier to view and understand, we can call upon the boxplot from the seaborn library and loop over each of the columns in the dataframe.

png
Photo by Will Myers on Unsplash

Once we have made the function we can create a list of the columns and “pop-off” or remove the well name column, which contains string data. We can then run the function for each of the log curves within the dataframe.

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

From the generated plot, we can see that a few of the measurements may contain outliers, which are highlighted in green. Based on the boxplot alone, it is not easy to tell if these points are real outliers. This is where we need to rely on domain knowledge and other methods.

Identifying Outliers with Crossplots

When working with multiple variables we can use crossplots to identify potential outliers. These involve plotting one logging measurement against another. A third logging measurement can be used to add colour to the plot to enhance the easy identification of outliers.

In this example, we will use a function to make a scatter plot (crossplot) of density vs neutron porosity data, which will be coloured by caliper.

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

We can see on the returned plot that there are a few points highlighted in orange/red which indicates possible washout if we assume an 8.5 inch bitsize.

This indicates that a few measurements may be impacted by badhole conditions. Many logging tools are capable of compensating for a certain degree of washout and rugosity.

For the purposes of this example and to illustrate the process of dealing with bad data points impacted by washout, we will remove any points that are over 9″. Which is 0.5″ over guage. Any points that are less than 8.5″ will also be removed.

We can do this by:

df = df[(df['CALI'] >= 8.5) & (df['CALI'] <= 9)]
df.describe()
png
Photo by Will Myers on Unsplash

We can now see that we have reduced our dataset further to 22,402 depth samples, which is down from 27,845 in the initial dataset.

Identifying Outliers Using Unsupervised Machine Learning Methods

Both supervised and unsupervised machine learning methods can be used to identify outliers within well log data and petrophysical data. For the purposes of this tutorial, we will focus on a few unsupervised learning techniques.

Unsupervised machine learning models attempt to identify underlying relationships within the data without the need for labeled categories. There are a number of unsupervised machine learning methods that can be used to identify anomalies / outliers within a dataset.

In this article we will look at three common methods:

  • Isolation Forest (IF)
  • One Class SVM (SVM)
  • Local Outlier Factor (LOF)

Isolation Forest

The isolation forest method is based upon decision trees. This method selects a feature/measurement and makes a random split in the data between the minimum and maximum values. This process then carries on down the decision tree until all possible splits have been made in the data. Any anomalies/outliers will be split off early in the process making them easy to identify and isolate from the rest of the data.

The image below illustrates a very simple example using a single variable — bulk density (RHOB).

png
Photo by Will Myers on Unsplash

One Class SVM

Support vector machines are a common machine learning tool for classification, which means they are good for splitting data up into different groups based on the data characteristics. This is achieved by identifying the maximum margin hyperplane between the groups of data as seen in the simple multi-class example below.

png
Photo by Will Myers on Unsplash

In a conventional SVM classification, we have more than one class or facies. However, when we only have one class of data we can use what is known as a One Class SVM.

In the case of outlier detection, what we want to do is find the boundary that separates the data points from the origin (see left-hand graph in figure below). Where in this case we treat the origin as the second class. Any points that are outside of the boundary line are considered outliers. We can control the position of the line by providing a value for the number of outliers we want to detect within our dataset. This parameter is known as the contamination level.

But in reality, outliers can exist on any side in relation to the main data cloud. In this case, we want to find a non-linear hyperplane that separates the outliers from the main data points (see right-hand graph in figure below). We can use an RBF kernel (Radial Basis Function Kernel) that finds the non-linear boundary between the points.

We can control how many outliers are allowed to be identified by specifying the contamination level. If we only want a small number of outliers, we can set this to a smaller number. Similarly, if we want a higher number of outliers to be detected, we can provide the algorithm with a higher number for contamination.

png
Photo by Will Myers on Unsplash

Local Outlier Factor

This method assesses the density of data points around a given point. Points that have a low density compared to others will be considered outliers. For further information on this method check out this link:

https://medium.com/mlpoint/local-outlier-factor-a-way-to-detect-outliers-dde335d77e1a

png
Photo by Will Myers on Unsplash

Creating the Models

We can create the models very simply using the code below, which is done using the scikit learn library.

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

We can now check the performance of each of the models using density-neutron crossplots. This is achieved using Seaborn’s FacetGrid and mapping a scatter plot to it as follows.

png
Photo by Will Myers on Unsplash

This returns the following plots for each method and the number of anomalous data points that were identified.

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

It appears that the IF method is providing a better result, followed by LOF and then SVM. Most of the outliers on the right-hand side of the plot are removed in the first two methods.

We can look at the data in more detail for each of the well for each of the methods.

png
Photo by Will Myers on Unsplash

Which returns the following plots.

png
Photo by Will Myers on Unsplash

This gives us a better idea of how the anomalies have been identified in each of the wells using the selected methods. We can see that the LOF method has highlighted a number of points within the centre of the data points, which may not be true outliers. This is the point where domain expert knowledge comes into play in confirming if the points are indeed outliers.

Displaying Outlier Points on Log Plots

To identify where the outliers have been detected, we can generate simple log plots for each method and for each well using the code below.

png
Photo by Will Myers on Unsplash

Before plotting the data we can make things easier by splitting up the dataframe into multiple dataframes based on the well name

png
Photo by Will Myers on Unsplash

Once the dataframe has been split up by well, we can use the index number of the list to access the required data.

If we loop through the well names list we can get the index number and the associated well name.

png
Photo by Will Myers on Unsplash

Which returns:

Wellname 	 Index
15/9-F-1 A 0
15/9-F-1 B 1
15/9-F-1 C 2
15/9-F-11 A 3
15/9-F-11 B 4

We can then use this to select the required well

png
Photo by Will Myers on Unsplash
png
Photo by Will Myers on Unsplash

The plots above show us where the anomalies/potential outliers exist on a conventional log plot. The interval from 3700 to 3875 is an interval that contains missing values. As we are using a line plot the points on either side of this gap are interpolated to create a line.

The highlighted intervals will need to be evaluated in closer detail and with domain knowledge, but for the purposes of this tutorial, we will go with the results of the Isolation Forest method.

Exporting the Results

Now that we have a clean dataset, we can export it to the required files for our machine learning stage.

To do this, we first need to create our temporary dataframe where we only use data that has been identified as inliers by the Isolation Forest algorithm. Once these points have been removed, we can then create our output dataframe.

png
Photo by Will Myers on Unsplash

Creating Supervised Learning Files

To prepare the Supervised Learning file, we need to carry out a few steps.

Train, Validate, and Test Split

Before exporting our data, we first need to split the data into training, validation, and test datasets.

The training dataset is used for training the model, the validation dataset is used to tune the model parameters, and the test dataset is used to verify the model on unseen data. It should be noted that these terms are used interchangeably and can cause confusion.

Test Data Separation

First, we will split off one well (15/9-F11 B) for our testing dataset, and the remainder of the wells and data will be assigned to training_val_wells

png
Photo by Will Myers on Unsplash

Training and Validation Datasets

Next, we split our training_val_wells data up into the training features and target features (SW, PHIF & VSH). In situations where we are using Sklearn for prediction, we would typically only specify one feature for prediction. However, for the example, we will be setting three target features.

png
Photo by Will Myers on Unsplash

When we check the training features (X), we can see we have all of the required columns.

X.head()
png
Photo by Will Myers on Unsplash

We can also check the head of the target features (y):

y.head()
png
Photo by Will Myers on Unsplash

Train-Test Split

Sklearn’s train_test_split module. For this example, we will use a training size of 70%, which leaves 30% for validating and tuning the model. random_state has been set to a fixed value to allow reproducible results for this workshop. If you want to randomise the values each time, you can remove this method input.

It should be noted that the method used here is called train_test_split. But what we are actually doing is splitting off the training and validation datasets.

png
Photo by Will Myers on Unsplash

Summary

In this article we have covered what outliers are and methods for identifying them using plots and unsupervised learning algorithms. Once the outliers have been identified they can be removed. This is done using a combination of the methods and domain expertise. If points are thrown away blindly, you could be throwing away valuable data. So always check that your identified outliers are really outliers.

This notebook was originally published for the SPWLA Machine Learning Workshop at the 2021 SPWLA Conference, and has since been modified.

Thanks for reading!

If you have found this article useful, please feel free to check out my other articles looking at various aspects of Python and well log data. You can also find my code used in this article and others at GitHub.

If you want to get in touch you can find me on LinkedIn or at my website.

Interested in learning more about python and well log data or petrophysics? Follow me on Medium.

If you enjoy reading these tutorials and want to support me as a writer and creator, then please consider signing up to become a Medium member. It’s $5 a month and you get unlimited access to many thousands of articles on a wide range of topics. If you sign up using my link, I will earn a small commission with no extra cost to you!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *