|

How to Create a Simple Neural Network Model in Python

Neural Networks are a popular (mostly) supervised machine learning algorithm. They can be used for modelling a variety of complicated tasks such as image processing, fraud detection, speech processing, and more. These algorithms can be applied to regression-based problems as well as classification-based problems.

Within petrophysics and geoscience, we can use Neural Networks to predict missing log measurements, create synthetic curves or create continuous curves from discretely sampled data.

In this article, I will show you how to create a simple Artificial Neural Network model using scitkit-learn. We will be applying the model to the task of predicting a logging measurement that is commonly absent from well measurements.

What is an Artificial Neural Network?

Neural Networks, or Artificial Neural Networks (ANN’s) as they are sometimes called, are formed from a series of functions which have been inspired by the way the human brain solves problems.

They “learn”, or rather are trained to identify patterns within the data, given a known target variable and a series of known inputs. ANN’s are composed of multiple layers containing nodes.

Typically, there is:

  • A single input layer, which contains the features that the model is trained on and applied to
  • Multiple hidden layers, which exist between the input and output layers, and can be a single layer deep or multiple layers deep
  • A single output layer, which contains our target variable(s)
Simple single layer neural network model for petrophysical log prediction.
A single layer neural network model that takes in multiple logging measurements and predicts a single continuous target variable.

If you want to find out more about how Artificial Neural Networks work, I would recommend exploring the article below.

Implementing an Artificial Neural Network in Python using Scikit-Learn

Importing Python Libraries

Before we begin our Artificial Neural Network python tutorial, we first need to import the libraries and modules that we are going to require.

  • pandas: used to load data in from a CSV file
  • matplotlib: used to create graphs of the data

Then, from Scikit-Learn, we will be importing the following modules:

  • train_test_split from model_selection: used to split our data into training and validation datasets
  • MLPRegressor from neural_network: this is the Neural Network algorithm we will be using
  • StandardScaler from preprocessing: used to standardise our data so that they are similarly scaled
  • metrics: used to assess our model performance
import pandas as pd
import matplotlib.pyplot as plt

#Scikit Learn Imports
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

Loading Well Log Data

Data Source

The data used within this tutorial is a subset of the Volve Dataset that was released by Equinor in 2018. Full details of the dataset, including licence can be found at the link below.

https://www.equinor.com/energy/volve-data-sharing

The Volve data license is based on CC BY 4.0 license. Full details of the license agreement can be found here:

https://cdn.sanity.io/files/h61q9gi9/global/de6532f6134b9a953f6c41bac47a0c055a3712d3.pdf?equinor-hrs-terms-and-conditions-for-licence-to-data-volve.pdf

Using Pandas to Load the Well Log Data

Once the libraries have been imported we can move onto importing our data. This is done by calling upon pd.read_csv() and passing in the location of our raw data file.

As the CSV file contains numerous columns, we can pass in a list of names to the usecols parameter, as we only want to use a small selection for this tutorial.

df = pd.read_csv('Data/Volve/volve_wells.csv', 
                usecols=['WELL', 'DEPTH', 'RHOB', 'GR', 'NPHI', 'PEF', 'DT'])

Data Preprocessing

The workflow for preprocessing data prior to running it through a machine learning algorithm will vary. For this tutorial we are going to:

  • remove missing values
  • split data into training, validation and testing datasets
  • standardise the range of values for each measurement

Dropping Missing Values

Missing data is one of the most common issues we face when working with real-world data. It can be missing for a variety of reasons including:

  • sensor errors
  • human error
  • processing errors

For more information on identifying and dealing with missing data you should have read the following article:

https://towardsdatascience.com/identifying-and-handling-missing-well-log-data-prior-to-machine-learning-5fa1a3d0eb73

For this tutorial, we are going to remove the rows that contain missing values. This is known as listwise deletion and is the quickest way to deal with the missing values. However, doing this reduces the size of the available dataset, and the cause and extent of the missing values should be fully understood before carrying on with a machine-learning model.

To drop the missing values, we can use pandas dropna() function and assign that back to the df (dataframe) variable.

df = df.dropna()

Splitting Data into Training, Testing and Validation Datasets

When carrying out machine learning, we often split our data into multiple subsets for training, validation and testing.

One thing to note is terminology for testing and validation datasets can vary between articles, websites and videos. The definitions used here are illustrated and described as follows:

Examples of splitting data into training, validation and testing subsets. Image by author and from McDonald, 2021.
Examples of splitting data into training, validation and testing. Image from McDonald, 2021.

Training Dataset: Data used for training the model

Validation Dataset: Data used for validating the model and tuning the parameters.

Testing Dataset: Data set aside to test the final model on unseen data. This subset allows us to understand how well our model can generalise to new data.

For this tutorial, our dataset contains 3 separate wells. So we will split one off (15/9-F-1 B) as our testing dataset. This is often referred to as a blind test well. the other two wells will be used to train, validate and tune our model.

Once these lists have been created, we can then create two new dataframes for the subsets. This is achieved by checking if the well(s) within the lists are within the main dataframe (df).

# Training Wells
training_wells = ['15/9-F-11 A', '15/9-F-1 A']

# Test Well
test_well = ['15/9-F-1 B']

# Create training and testing dataframes
train_val_df = df[df['WELL'].isin(training_wells)].copy()
test_df = df[df['WELL'].isin(test_well)].copy()

Once we have run the above code, we can view the statistics of the subsets using the describe() method.

train_val_df.describe()
Dataframe statistics of the training and validation subset containing two wells worth of data from the Volve field.

We can see that we have 21,6888 rows of data to train and validate our model with.

We can repeat this with the testing dataset:

test_df.describe()
Dataframe statistics of the testing subset containing one wells worth of data from the Volve field.

Creating the Training and Validation Subsets

The next step is to further subdivide our train_val_df into the training and validation subsets.

To do this we first split our data up into features we are going to use for training (X) and our target feature (y). We then call upon the train_test_split() function to split our data.

Within this function, we pass in our X and y variables, along with the parameter for indicating how large we want the test data set. This is entered as a decimal value and ranges between 0 and 1.

In this case, we have used 0.2, which means our test dataset will be 20% and our training dataset will be 80% of the original data.

# Setup the columns for training and target features
X = train_val_df[['RHOB', 'GR', 'NPHI', 'PEF']]
y = train_val_df['DT']

# Split the data into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

Standardising Values

When working with measurements that have different scales and ranges, it is important to standardise them. This helps to reduce model training times and reduces the impact on models that rely on distance-based calculations.

Standardising the data essentially involves calculating the mean of a feature, subtracting it from each data point and then dividing by the feature’s standard deviation.

Within scikit-learn we can use the StandardScaler class to transform our data.

First, we use the training data to fit the model and then transform it using the fit_transform function.

When it comes to the validation data, we don’t want to fit the StandardScaler to that data as we have already done it. Instead, we just want to apply it. This is done using the transform method.

scaler = StandardScaler()

#Fit the StandardScaler to the training data
X_train = scaler.fit_transform(X_train)

# Apply the StandardScaler, but not fit, to the validation data
X_val = scaler.transform(X_val)

Building the Neural Network Model

Training the Model

To begin the Neural Network training process, we first have to create an instance of the MLPRegressor that we imported at the start.

When we call upon the MLPRegressor we can specify numerous parameters. You can find out more about these hereBut, for this tutorial we will be using:

  • hidden_layer_sizes: Controls the architecture of the network.
  • activation: Hidden layer activation function.
  • random_state: When an integer is us used, this allows the model to create reproducible results and it controls the random number generation for the weights and biases.
  • max_iter: Controls the maximum number of iterations that the model will go to if convergence is not met beforehand.
model = MLPRegressor(hidden_layer_sizes=(64, 64,64), 
                     activation="relu" ,
                     random_state=42, max_iter=2000)

After initialising the model, we can train our model with the training data using fit(), and then use the predict method to make prediction

model.fit(X_train, y_train)

#Predict on the validation data
y_pred = model.predict(X_val)

Validating Model Results

Now that our model has been trained, we can begin evaluating the model’s performance on our validation dataset.

It is at this stage we can tweak our model and optimise it.

There are multiple statistical measurements we can use to measure how well our model has performed. For this tutorial we will be using the following three metrics:

Mean Absolute Error (MAE): Provides a measure of the absolute differences between the predicted value and the actual value.

Mean Absolute Error formula for analysing machine learning results.

Root Mean Square Error (RMSE): Indicates the magnitude of the prediction error.

To calculate RMSE using scikit-learn we first need to calculate the mean squared error and then take the square root of it, which can be achieved by raising the mse to the power of 0.5.

Root Mean Square Error Formula for assessing machine learning prediction results

Coefficient of Correlation (R2): Indicates the strength of the relationship between an independent variable and a dependent variable. The closer the value is to 1, the stronger the relationship.

We can calculate the above metrics as follows:

mae = metrics.mean_absolute_error(y_val, y_pred)

mse = metrics.mean_squared_error(y_val, y_pred)
rmse = mse**0.5 

r2 = metrics.r2_score(y_val, y_pred)

print(f"""
MAE: \t{mae:.2f}
RMSE: \t{rmse:.2f}
r2: \t{r2:.2f}
""")

When we execute the above code we get the following results back. Based on these numbers we can determine if our model is performing well or if it needs tweaking,

Metric values for validation data prediction.

Going Beyond Metrics

Simple metrics like the above are a nice way to see how a model has performed, but you should always check the actual data.

One way to do this is to use a scatter plot with the validation data on the x-axis, and the predicted data on the y-axis. To help with the visualisation we can add a 1-to-1 relationship line.

The code to do this is as follows.

plt.scatter(y_val, y_pred)
plt.xlim(40, 140)
plt.ylim(40, 140)
plt.ylabel('Predicted DT')
plt.xlabel('Actual DT')
plt.plot([40,140], [40,140], 'black') #1 to 1 line
plt.show()

When we run the code above, we get back the following plot which shows us we have a reasonably good trend between the actual measurement and the predicted result.

Actual acoustic compressional slowness values versus predicted values for the Volve dataset.

Testing the Model on Unseen Data

Once we have finalised our model, we can finally test it out on the data we set aside for blind testing.

First, we will create the features we will use for applying the model. Then we will apply the StandardScaler model we created earlier in order to standardise our values.

And next, we will assign a new column to our dataframe for our predicted data.

test_well_x = test_df[['RHOB', 'GR', 'NPHI', 'PEF']]

test_well_x = scaler.transform(test_well_x)

test_df['TEST_DT'] = model.predict(test_well_x)

Once the prediction has been made, we can view the same scatter plot as above.

plt.scatter(test_df['DT'], test_df['TEST_DT'])
plt.xlim(40, 140)
plt.ylim(40, 140)
plt.ylabel('Predicted DT')
plt.xlabel('Actual DT')
plt.plot([40,140], [40,140], 'black') #1 to 1 line
plt.show()

Within petrophysics and geoscience we often look at data on log plots, where measurements are plotted against depth. We can create a simple log plot of our predicted result and actual measurement within the test well like so.


plt.figure(figsize=(12, 4))
plt.plot(test_df['DEPTH'], test_df['DT'], label='Actual DT')
plt.plot(test_df['DEPTH'], test_df['TEST_DT'], label='Predicted DT')

plt.xlabel('Depth (m)', fontsize=14, fontweight='bold')
plt.ylabel('DT', fontsize=14,fontweight='bold')

plt.ylim(40, 140)
plt.legend(fontsize=14)
plt.grid()

This returns the following plot.

We can see our model has performed well on the unseen data, however, there are a few areas where the result is not matching the true measurement. Notably between 3100 and 3250 m.

This tells us that our model may not have enough training data covering these intervals, and as a result, we may need to acquire more data if it is available.

Line plot (log plot) of our predicted measurement against our actual measurement.

If you want to see how this model compares to the results of a Random Forest model, check out the article below:

https://towardsdatascience.com/random-forest-regression-for-continuous-well-log-prediction-61d3ec1c683a

Summary

Artificial Neural Networks are a popular machine learning technique. Within this tutorial, we have covered a very quick and easy way to implement a model for predicting acoustic compressional slowness that yields reasonable results. We have also seen how to validate and test our model, which is an important part of the process.

There are many other ways to build up a neural network within Python, such as Tensorflow and Keras, however, Scitkit-learn provides a quick and easy-to-use tool to get started right away.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *