How to Level up Your Pandas Skills in 5 Easy Ways
Pandas is a powerful and versatile Python library for data science. It is often one of the first libraries you come across when learning how to use Python for data science applications. It can be used to load data, visualise it and manipulate it to suit the objectives of the project you are working on.
However, many people don’t go beyond the basics of how to use the library and don’t take advantage of some of the more advanced and interesting features.
Within this article, we will go through 5 features that you may not have come across before that will make you more efficient when using the library.
Loading Libraries and Data
For all of the examples used within this article, the data that we will be using is a subset of well log data that was used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). This data is publicly available here and licensed under Norwegian Licence for Open Government Data (NLOD) 2.0.
However, all methods and examples shown here can be applied to any dataset loaded into pandas.
To begin, we first need to import the pandas library, which is commonly shortened to pd
to make life easier.
We will then read the csv file into a dataframe using the read_csv()
function.
import pandas as pd
df = pd.read_csv('data/Xeek_train_subset_clean.csv')
df
When we view the dataframe we can see we have 12 columns of data containing a mixture of text and numeric values.
Now that is done, we can now move on to the 5 ways you can level up your data science skills with pandas.
Changing the Default Pandas Plotting Library
When carrying out Exploratory Data Analysis we often want to generate a quick plot of our data. You could build up a plot using matplotlib, however, we can do it with a few lines of code using pandas.
Once you have the data loaded you can call upon the dataframe followed by the .plot
method.
For example, if we take the GR column and plot a histogram, like so:
df.GR.plot(kind='hist', bins=30)
This will return the following plot:
The image that is returned is very basic and lacks interactivity.
From pandas version 0.25 it is possible to change which plotting library is used. For example instead of matplotlib, you can switch this out for plotly. Plotly allows the generation of very powerful and interactive data visualisations in an efficient way.
For more info on how to use Plotly to generate another type of plot called a Scatter Plot, you may be interested in exploring my previous articles:
- Using Plotly Express to Create Interactive Scatter Plots
- Enhance Your Plotly Express Scatter Plot With Marginal Plots
To change the default plotting library for your current instance you need to change the plotting.backend
option for pandas.
pd.options.plotting.backend = "plotly"
Once you have done that, you can then call upon the same code as before
df.GR.plot(kind='hist', bins=30)
And the plot that is generated is now interactive and provides a much better user experience when exploring datasets.
You can also use other types of plots, including scatter plots by changing the code around slightly like so:
df.plot(kind='scatter', x='NPHI', y='RHOB', color='GR')
When the code above is executed, you get back the following plot:
Chaining Operations
Chaining or joining multiple methods together is a long-practised programming technique that can improve code readability.
It is the process of calling methods on an object one after the other on a single line rather than applying the methods on the object separately. This can help with multiple stages of your process.
For example, if you want to load data, change the columns to lowercase and then drop missing values (NaNs) it could be done like this:
df = pd.read_csv('data/Xeek_train_subset_clean.csv')
df = df.rename(columns=str.lower)
df = df.dropna()
However, a more efficient way would be to chain the operations like this:
df = pd.read_csv('data/Xeek_train_subset_clean.csv').rename(columns=str.lower).dropna()
Sometimes the line may become very long so you may want to make it more readable by splitting it over multiple lines.
This can be done using the line continuation character ( \
) as suggested in this StackOverflow question.
df = pd.read_csv('data/Xeek_train_subset_clean.csv')\
.rename(columns=str.lower)\
.dropna()
Or by using parentheses as suggested in this other StackOverflow question, which removes the need for the line continuation character.
df = (pd.read_csv('data/Xeek_train_subset_clean.csv')
.rename(columns=str.lower)
.dropna())
query()
A common task that we often do when working with data is to filter it based on single or multiple conditions. You can do this using the following code:
df[df.GR > 100]
However, using the pandas query function produces a more readable piece of code, especially when things become a little more complex.
For example, if you want to find all rows where the GR (Gamma Ray) column contains values > 100 you can call upon the query method like so:
df.query("GR > 100")
Which returns the following dataframe object.
You can also use logic if you want to combine multiple conditions.
df.query("GR > 100 and GR < 110")
Which returns an even smaller dataframe with just 7,763 rows
If you want to look for a specific string value, like Anhydrite within this dataset you need to modify our query method and chain a few methods together.
df.query("LITH.str.contains('Anhydrite')")
This will return any row where the LITH column contains the word Anhydrite
This can also be used if the string contains special characters such as the WELL column within our dataframe which has backslashes and dashes:
df.query("WELL.str.contains('15/9-13') and GROUP.str.contains('ZECHSTEIN GP.')")
eval()
The eval() method within Python is a powerful tool when it comes to evaluating arbitrary Python expressions on columns within the same dataframe.
This means you can take columns from a dataframe and carry out arithmetic calculations by providing a string expression.
For example, you could subtract a value of 100 from the GR column:
df.eval('GR-100')
Which returns
If you wanted to place this calculation in a new column you need to call upon pd.eval
and pass in the expression followed by the target dataframe.
pd.eval('GR_TEST = df.GR - 100', target=df)
A common calculation within petrophysics is to calculate the volume of clay present within a formation. To do this sort of calculation you just extend the expression:
pd.eval('CLAYVOL = ((df.GR - 20)/(200-10))', target=df)
Which creates a new column called CLAYVOL
If you were doing a proper petrophysical analysis you would need to consider the selected parameters based on multiple formations or depth ranges. The above illustrates a quick method of carrying out the calculation.
map()
If you have a situation where we need to match up values from an object such as a dictionary or substitute values within a dataframe with another value, we can use the map function.
This function can only be applied along a single column within a dataframe or on a Pandas Series.
With this data example, we can create a new column containing a numeric code based on the text string. This can be achieved using another dataframe or with a dictionary.
If using the dictionary as a reference you first need to create one or load one. This example uses a simple one that has been quickly created.
lith_dict = {'Shale':1,
'Sandstone':2,
'Sandstone/Shale':3,
'Limestone':4,
'Tuff':5,
'Marl':6,
'Anhydrite':7,
'Dolomite':8,
'Chalk':9,
'Coal':10,
'Halite':11}
Next, you would create a new column, for example, LITH_CODE.
You then call upon the LITH column and apply the .map
function, which contains the dictionary created above.
df['LITH_CODE'] = df['LITH'].map(lith_dict)
When you call upon the dataframe you now have the new column with the lithology code mapped to the correct lithology.
Summary
Pandas is an incredible library, that allows users to visualise, transform and analyse data in a very intuitive way. This article has covered some of the lesser-known features and methods, many of which may be new to you if you are just starting out on your data science journey. Understanding these will help you leverage the power of pandas to improve your data analysis skills.