k-Nearest Neighbors (kNN) is a popular non-parametric supervised machine learning algorithm that can be applied to both classification and regression-based problems. It is easy to implement in Python and easy to understand which makes it a great algorithm to start learning about when you start your machine-learning journey.
Within this article, we will cover how the kNN algorithm works and how to apply it to well log data using Python’s Scikit-Learn library.
How does the kNN Algorithm Work?
Classifying data is one of the main applications of machine learning. As a result, there are numerous algorithms available. The kNN algorithm is just one of these.
The idea behind kNN is pretty simple. Points that are near each other are assumed to be similar.
When a new data point is introduced to a trained dataset the following steps occur
- Determine a value for k — the number of points to be used to classify new data points
- Calculate the distance (Euclidean or Manhattan) between the data point to be classified and k nearest points
- Identify the k-nearest neighbors
- Amongst these k-nearest neighbors, we count the number of data points in each class
- Using majority voting, assign the new data point to the class that occurs the most
The simple example below shows this process where we assume k is 3 and the nearest points are all a single class.
In the case where the k-nearest neighbors are a mixture of classes, we can use majority voting as illustrated below.
Applications of k-Nearest Neighbors (kNN)
- Recommender Systems
- Pattern Detection — e.g Fraud detection
- Text mining
- Climate forecasting
- Credit rating analysis
- Medical Classification
- Lithology prediction
Advantages of k-Nearest Neighbors (kNN)
- Simple and easy to understand
- Easy to implement with Python using Sci-kit Learn
- Can be fast to work on small datasets
- No need to tune multiple parameters
- No need to make assumptions about the data
- Can be applied to binary and multi-class problems
Disadvantages of k-Nearest Neighbors (kNN)
- Classification with large datasets can be slow
- Impacted by the curse of dimensionality — as the number of features increases the algorithm may struggle to make accurate predictions
- Can be sensitive to the scale of the data, i.e. features measured using different units
- Impacted by noise and outliers
- Sensitive to imbalanced datasets
- Missing values need to be handled prior to using the algorithm
KNN Implementation with Scikit-Learn to Classify Facies
Importing the Required Libraries
For this tutorial, we require a number of Python libraries and modules.
First, we will import
pd. This library allows us to load data from csv files and store that data in memory for later use.
Then we have a number of modules from the sci-kit learn library:
KNeighborsClassiferfor carrying out the kNN classification
train_test_splitfor splitting up our data into training and testing datasets
StandardScalerfor standardising the scales of the features
accuracy_scorefor assessing model performance
Finally, to visualise our data we will be using a mixture of
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
Importing the Required Data
The next step is to load our data.
The dataset we are using for this tutorial is a subset of a training dataset used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). It is released under a NOLD 2.0 licence from the Norwegian Government, details of which can be found here: Norwegian Licence for Open Government Data (NLOD) 2.0.
The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.
To read the data, we can call upon
pd.read_csv() and pass in the relative location of the training file.
df = pd.read_csv('Data/Xeek_train_subset_clean.csv')
Once the data has been loaded, we can call upon the
describe() method to view the numeric columns within the dataset. This provides us with an overview of the features.
Dealing With Missing Data
Before we proceed with the kNN algorithm, we first need to carry out some data preparation.
As the kNN algorithm doesn’t handle missing values we need to deal with these first. The simplest way to do that is to carry out listwise deletion. This will delete rows if any of the features within that row has missing values.
It is highly recommended that you carry out a full analysis of your dataset to understand the cause of the missing data and if it can be repaired.
Even though this method seems a quick solution, it can reduce your dataset significantly.
df = df.dropna()
Selecting Training and Test Features
Next, we need to select what features will be used to build the kNN model and what feature will be our target feature.
For this example, I am using a series of well logging measurements for building the model, and a lithology description as the target feature.
# Select inputs and target
X = df[['RDEP', 'RHOB', 'GR', 'NPHI', 'PEF', 'DTC']]
y = df['LITH']
As with any machine learning model, we need to split our data out into a training set — which is used to train/build our model — and a test set — which is used to validate the performance of our model on unseen data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Standardising Feature Values
When working with measurements that have different scales and ranges, it is important to standardise them. This helps to reduce model training times and reduces the impact on models that rely on distance-based calculations.
Standardising the data essentially involves calculating the mean of a feature, subtracting it from each data point and then dividing by the feature’s standard deviation.
First we use the training data to fit the model and then transform it using the
When it comes to the test data, we don’t want to fit the StandardScaler to that data as we have already done it. Instead, we just want to apply it. This is done using the
It is important to note that the StandardScaler is being applied after the train test split and it is only being fitted to the training dataset. Once the Scaler model has been fitted, it is then applied to the test dataset. This helps prevent the leakage of data from the test dataset into the kNN model.
scaler = StandardScaler()
#Fit the StandardScaler to the training data
X_train = scaler.fit_transform(X_train)
# Apply the StandardScaler, but not fit, to the validation data
X_test = scaler.transform(X_test)
Building the kNN Classifier
When creating the KNeighborsClassifier we can specify a few parameters. Full details of these can be found here. Of course, we don’t have to supply anything and the default parameters will be used.
By default, the number of points used to classify new data points is set to 5. This means that the class of the 5 closest points will be used to classify that new point.
clf = KNeighborsClassifier()
Once the classifier has been initialised, we next need to train the model using our training data (
y_train). To do this, we call upon
clf followed by the
fit method, we pass in our training data.
Making Predictions with the kNN Model
Once the model has been trained, we can now make predictions on our test data by calling upon the
predict method from the classifier.
y_pred = clf.predict(X_test)
Assessing Model Performance
Using Model Accuracy
To understand how our model has performed on the test data we can use a number of metrics and tools.
If we want a quick assessment of how well our model has performed, we can call upon the accuracy score method. This provides us with an indication of how many predictions were correct relative to the total number of predictions.
This returns a value of 0.8918532439941167 and tells us that our model has predicted 89.2% of our labels correctly.
Be aware that this value may be misleading, especially if we are dealing with an imbalanced dataset. If there is a class that dominates, then this class has a higher chance of being predicted correctly compared to a minority class. The class that dominates will influence the accuracy score by making it higher and thus giving a false impression that our model has done a good job.
Using Classification Report
We can take our assessment further and look at the classification report. This provides additional metrics as well as an indication of how well each class was predicted.
The additional metrics are:
- precision: Provides an indication of how many values have been correctly predicted within that class. Values are between 0.0 and 1.0, with 1 being the best and 0 being the worst.
- recall: Provides a measure of how well the classifier is able to find all of the positive cases for that class.
- f1-score: Weighted harmonic mean of precision and recall and generates values between 1.0 (which is good) and 0.0 (which is poor).
- support: This is the total number of instances of that class within the dataset.
To view the classification report we can call upon the follow code and pass in
y_pred to the
If we look at the results closely, we can see we are dealing with an imbalanced dataset. We can see that Shale, Sandstone and Limestone classes dominate and as a result have relatively high precision and recall scores. Whereas Halite, Tuff and Dolomite have relatively low precion and recall.
At this point, I would consider going back to the original dataset and identifying ways that I could deal with that imbalance. Doing so should greatly improve the model’s performance.
We can use another tool to look at how well our model has performed and that is the confusion matrix. This tool provides a summary of how well our classification model has performed when making predictions for each class.
The generated confusion matrix has two axes. One axis contains the class that the model predicted, and the other axis contains the actual class label.
We can generate two versions of this within Python. The first is a simple printed readout of the confusion matrix which can be hard to read or present to others. The second is a heatmap version generated using seaborn
# Simple Printed Confusion Matrix
cf_matrix = confusion_matrix(y_test, y_pred)
# Graphical version using seaborn and matplotlib
# Prepare the labels for the axes
labels = ['Shale', 'Sandstone', 'Sandstone/Shale',
'Limestone', 'Tuff', 'Marl', 'Anhydrite',
'Dolomite', 'Chalk', 'Coal', 'Halite']
# Setup the figure
fig = plt.figure(figsize=(10,10))
ax = sns.heatmap(cf_matrix, annot=True, cmap='Reds', fmt='.0f',
yticklabels = labels)
ax.set_title('Seaborn Confusion Matrix with labels\n\n')
ax.set_ylabel('Actual Values ');
When we run the above code we get the following printed table and plot.
The resulting confusion matrix provides us with an indication of what classes the model predicted correctly and incorrectly. We can start to identify any patterns where the model may be mispredicting lithologies.
For example, if we look at the Limestone class. We can see 2,613 points were predicted correctly, however, 185 were predicted as Chalk and 135 as Marl. Both of these lithologies have a calcitic nature and share similar properties to limestone. Therefore, we could go back and look at our features to determine if other features are required or if some need to be removed.
The k-Nearest Neighbors algorithm is powerful, yet easy-to-understand supervised machine learning algorithm that can be applied to classification-based problems, especially within the geoscience domain.
This tutorial has shown how we can take a series of pre-classified well log measurements and make predictions about new data. However, care should be taken when preprocessing the data and dealing with imbalanced datasets, which is common in subsurface applications.