pandas playing tree branch
|

How to Level up Your Pandas Skills in 5 Easy Ways

Pandas is a powerful and versatile Python library for data science. It is often one of the first libraries you come across when learning how to use Python for data science applications. It can be used to load data, visualise it and manipulate it to suit the objectives of the project you are working on.

However, many people don’t go beyond the basics of how to use the library and don’t take advantage of some of the more advanced and interesting features.

Within this article, we will go through 5 features that you may not have come across before that will make you more efficient when using the library.

Loading Libraries and Data

For all of the examples used within this article, the data that we will be using is a subset of well log data that was used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). This data is publicly available here and licensed under Norwegian Licence for Open Government Data (NLOD) 2.0.

However, all methods and examples shown here can be applied to any dataset loaded into pandas.

To begin, we first need to import the pandas library, which is commonly shortened to pd to make life easier.

We will then read the csv file into a dataframe using the read_csv() function.

import pandas as pd
df = pd.read_csv('data/Xeek_train_subset_clean.csv')
df

When we view the dataframe we can see we have 12 columns of data containing a mixture of text and numeric values.

Dataframe of well log data from the Xeek Force 2020 Machine Learning competition. Image by the author.
Dataframe of well log data from the Xeek Force 2020 Machine Learning competition. Image by the author.

Now that is done, we can now move on to the 5 ways you can level up your data science skills with pandas.

Changing the Default Pandas Plotting Library

When carrying out Exploratory Data Analysis we often want to generate a quick plot of our data. You could build up a plot using matplotlib, however, we can do it with a few lines of code using pandas.

Once you have the data loaded you can call upon the dataframe followed by the .plot method.

For example, if we take the GR column and plot a histogram, like so:

df.GR.plot(kind='hist', bins=30)

This will return the following plot:

Gamma Ray histogram generated by pandas using the .plot() function. Image by Andy McDonald.
Gamma Ray histogram generated by pandas using the .plot() function. Image by the Author.

The image that is returned is very basic and lacks interactivity.

From pandas version 0.25 it is possible to change which plotting library is used. For example instead of matplotlib, you can switch this out for plotlyPlotly allows the generation of very powerful and interactive data visualisations in an efficient way.

For more info on how to use Plotly to generate another type of plot called a Scatter Plot, you may be interested in exploring my previous articles:

To change the default plotting library for your current instance you need to change the plotting.backend option for pandas.

pd.options.plotting.backend = "plotly"

Once you have done that, you can then call upon the same code as before

df.GR.plot(kind='hist', bins=30)

And the plot that is generated is now interactive and provides a much better user experience when exploring datasets.

Gamma Ray Histogram after changing the pandas backend plotting option to plotly. Image by Andy McDonald.
Gamma Ray Histogram after changing the pandas backend plotting option to plotly. Image by the author.

You can also use other types of plots, including scatter plots by changing the code around slightly like so:

df.plot(kind='scatter', x='NPHI', y='RHOB', color='GR')

When the code above is executed, you get back the following plot:

Interactive plotly scatterplot created after changing the pandas backend plotting option. Image by Andy McDonald
An interactive plotly scatterplot created after changing the pandas backend plotting option. Image by the author.

Chaining Operations

Chaining or joining multiple methods together is a long-practised programming technique that can improve code readability.

It is the process of calling methods on an object one after the other on a single line rather than applying the methods on the object separately. This can help with multiple stages of your process.

For example, if you want to load data, change the columns to lowercase and then drop missing values (NaNs) it could be done like this:

df = pd.read_csv('data/Xeek_train_subset_clean.csv')
df = df.rename(columns=str.lower)
df = df.dropna()

However, a more efficient way would be to chain the operations like this:

df = pd.read_csv('data/Xeek_train_subset_clean.csv').rename(columns=str.lower).dropna()

Sometimes the line may become very long so you may want to make it more readable by splitting it over multiple lines.

This can be done using the line continuation character ( \ ) as suggested in this StackOverflow question.

df = pd.read_csv('data/Xeek_train_subset_clean.csv')\
.rename(columns=str.lower)\
.dropna()

Or by using parentheses as suggested in this other StackOverflow question, which removes the need for the line continuation character.

df = (pd.read_csv('data/Xeek_train_subset_clean.csv')
.rename(columns=str.lower)
.dropna())

query()

A common task that we often do when working with data is to filter it based on single or multiple conditions. You can do this using the following code:

df[df.GR > 100]

However, using the pandas query function produces a more readable piece of code, especially when things become a little more complex.

For example, if you want to find all rows where the GR (Gamma Ray) column contains values > 100 you can call upon the query method like so:

df.query("GR > 100")

Which returns the following dataframe object.

Dataframe after using the pandas query method to filter for rows where gamma ray is greater than 100 API. Image by Andy McDonald
Dataframe after using the pandas query method to filter for rows where gamma ray is greater than 100 API. Image by the author.

You can also use logic if you want to combine multiple conditions.

df.query("GR > 100 and GR < 110")

Which returns an even smaller dataframe with just 7,763 rows

Dataframe after using the pandas query method to filter for rows where gamma ray is greater than 100 API but less than 110 API. Image by Andy McDonald
Dataframe after using the pandas query method to filter for rows where gamma ray is greater than 100 API but less than 110 API. Image by the author.

If you want to look for a specific string value, like Anhydrite within this dataset you need to modify our query method and chain a few methods together.

df.query("LITH.str.contains('Anhydrite')")

This will return any row where the LITH column contains the word Anhydrite

Dataframe after using the pandas query method to filter for rows where LITH contains Anhydrite. Image by Andy McDonald
Dataframe after using the pandas query method to filter for rows where LITH contains Anhydrite. Image by the author.

This can also be used if the string contains special characters such as the WELL column within our dataframe which has backslashes and dashes:

df.query("WELL.str.contains('15/9-13') and GROUP.str.contains('ZECHSTEIN GP.')")
Dataframe after using the pandas query method to filter for rows where GROUP contains Zechstein Gp. and the well name is 15/9–13. Image by Andy McDonald
Dataframe after using the pandas query method to filter for rows where GROUP contains Zechstein Gp. and the well name is 15/9–13. Image by the author.

eval()

The eval() method within Python is a powerful tool when it comes to evaluating arbitrary Python expressions on columns within the same dataframe.

This means you can take columns from a dataframe and carry out arithmetic calculations by providing a string expression.

For example, you could subtract a value of 100 from the GR column:

df.eval('GR-100')

Which returns

Results after using the eval method to carry out a simple calculation. Image by the author.

If you wanted to place this calculation in a new column you need to call upon pd.eval and pass in the expression followed by the target dataframe.

pd.eval('GR_TEST = df.GR - 100', target=df)
Dataframe after using pd.eval to add a new column based on a simple expression. Image by the author.

A common calculation within petrophysics is to calculate the volume of clay present within a formation. To do this sort of calculation you just extend the expression:

pd.eval('CLAYVOL = ((df.GR - 20)/(200-10))', target=df)

Which creates a new column called CLAYVOL

Dataframe after using the pd.eval method to calculate a clay volume based on the Gamma Ray column. Image by the author.

If you were doing a proper petrophysical analysis you would need to consider the selected parameters based on multiple formations or depth ranges. The above illustrates a quick method of carrying out the calculation.

map()

If you have a situation where we need to match up values from an object such as a dictionary or substitute values within a dataframe with another value, we can use the map function.

This function can only be applied along a single column within a dataframe or on a Pandas Series.

With this data example, we can create a new column containing a numeric code based on the text string. This can be achieved using another dataframe or with a dictionary.

If using the dictionary as a reference you first need to create one or load one. This example uses a simple one that has been quickly created.

lith_dict = {'Shale':1,
'Sandstone':2,
'Sandstone/Shale':3,
'Limestone':4,
'Tuff':5,
'Marl':6,
'Anhydrite':7,
'Dolomite':8,
'Chalk':9,
'Coal':10,
'Halite':11}

Next, you would create a new column, for example, LITH_CODE.

You then call upon the LITH column and apply the .map function, which contains the dictionary created above.

df['LITH_CODE'] = df['LITH'].map(lith_dict)

When you call upon the dataframe you now have the new column with the lithology code mapped to the correct lithology.

Dataframe after using the map function to create a new lithology code column based on the contents of string. Image by Andy McDonald
Dataframe after using the map function to create a new lithology code column based on the contents of string. Image by the author.

Summary

Pandas is an incredible library, that allows users to visualise, transform and analyse data in a very intuitive way. This article has covered some of the lesser-known features and methods, many of which may be new to you if you are just starting out on your data science journey. Understanding these will help you leverage the power of pandas to improve your data analysis skills.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *