045 Common Pandas Mistakes#

COM6018

Copyright © 2023, 2024 Jon Barker, University of Sheffield. All rights reserved.

1. Introducing#

This notebook considers some common issues that you may encounter when using Pandas. If you have any suggestions for other common problems to include, please let me know.

Below we will import NumPy and Pandas, which will be used in the examples that follow.

import pandas as pd
import numpy as np

2. The SettingWithCopyWarning#

The ‘SettingWithCopyWarning’ is one of the most commonly encountered problems for new users of Pandas. You will find many people asking about it online. As it is a warning, it is easy to ignore. However, it is important to understand what it means and how to fix your code when you see it because it means that you are doing something that is not guaranteed to work.

In order to explain the warning, we will first make a simple example that generates it. We will start with a simple DataFrame with some age and height data for a few people.

data = {"name": ["Bill", "Jane", "Sue", "Xingyi", "Maryam"],
        "age": np.array([45, 98, 24, 11, 64]),
        "height": np.array([1.73, 1.62, 1.83, 1.11, 1.54])
}
df = pd.DataFrame(data)

Now, let us say that we want to change Bill’s age to be 100. We might try to do this as follows:

df[df.name=='Bill']["age"] = 100

print("\n\n Printing df after trying to set Bill's age to 100")
print(df.head())
 Printing df after trying to set Bill's age to 100
     name  age  height
0    Bill   45    1.73
1    Jane   98    1.62
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64    1.54
/tmp/ipykernel_2293/753014212.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df.name=='Bill']["age"] = 100

This seems very natural, but it (probably) hasn’t worked and it has generated a SettingWithCopyWarning.

To understand the origins of the problem, it is necessary to understand that Pandas operations can return either a copy of the data or a view of the data. A copy is a new DataFrame that contains a completely new copy of the original data (or some part of it). A view is a new DataFrame that contains a reference to the original data (or some part of it). If you change the data in a copy, then the original data remains unchanged. If you change the data in a view, then the original data is also changed. (This is similar to the difference between a copy and a reference in Python.)

In our code above, the expression df[df.name=='Bill'] has returned a copy of a slice of the DataFrame df and not a view. This means that when we use ["age"] = 100 to set the age of Bill to 100, we are actually setting the age of Bill to 100 in a copy, i.e., and not in the original DataFrame. The original DataFrame is unchanged.

The code above is actually equivalent to the following:

df_copy = df[df.name=='Bill']
df_copy["age"] = 100

print("\n\n Printing df after trying to set Bill's age to 100")
print(df.head())
 Printing df after trying to set Bill's age to 100
     name  age  height
0    Bill   45    1.73
1    Jane   98    1.62
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64    1.54
/tmp/ipykernel_2293/2079350034.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_copy["age"] = 100

In the above, we would not be surprised the df has not been changed by df_copy["age"] = 100.

In fact, the situation is worse than this. Whether or not the operation df[df.name=='Bill'] returns a copy or a view is ‘undetermined,’ i.e., it is neither guaranteed to return one or the other. The choice depends on a lot of things: the type of data in the DataFrame, the version of Pandas that you are using, etc. So, it is best not to ignore this warning even if the code appears to work. It is tells you that you are doing something that is not guaranteed to work.

As noted in the warning message, the correct way to set Bill’s age to 100 is to index the element that you want to change in a single step using loc as follows:

df.loc[df.name=='Bill', "age"] = 100

print("\n\n Printing df after trying to set Bill's age to 100")
print(df)
 Printing df after trying to set Bill's age to 100
     name  age  height
0    Bill  100    1.73
1    Jane   98    1.62
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64    1.54

Note that no warning has appeared and Bill’s age has successfully changed to 100.

3. SettingWithCopyWarning - another example#

Let us say that we want to make a new DataFrame containing just the people over 50 years old, and in this new DataFrame we want to change the height so that it is measured in centimetres rather than metres. We might try to do this as follows:

df_centimeter = df[df.age > 50]
df_centimeter["height"] *= 100

print("\n\n Printing the original df")
print(df)
print("\n\n Printing the new df_centimeter")
print(df_centimeter)
 Printing the original df
     name  age  height
0    Bill  100    1.73
1    Jane   98    1.62
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64    1.54


 Printing the new df_centimeter
     name  age  height
0    Bill  100   173.0
1    Jane   98   162.0
4  Maryam   64   154.0
/tmp/ipykernel_2293/2966837033.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_centimeter["height"] *= 100

Again, we get the same warning. The problem is again that we cannot be sure whether df_centimeter is a copy or a view of df. So, whether the conversion to centimetres is also applied to the original df is undetermined. Even if it appears to work, it may not work in a different version of Pandas.

We need to rewrite the code so that it is guaranteed to work as expected. There are two cases: either we didn’t want the original DataFrame to be changed or we did.

If we want the original DataFrame to remain unchanged, then we need to explicitly state that df_centimeter is a copy of df. This is done using the copy() method as follows:

df_centimeter = df[df.age > 50].copy()
df_centimeter["height"] *= 100

print("\n\n Printing the original df")
print(df)
print("\n\n Printing the new df_centimeter")
print(df_centimeter)
 Printing the original df
     name  age  height
0    Bill  100    1.73
1    Jane   98    1.62
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64    1.54


 Printing the new df_centimeter
     name  age  height
0    Bill  100   173.0
1    Jane   98   162.0
4  Maryam   64   154.0

Alternatively, if we wanted to apply the conversion to centimetres to the original DataFrame then we need to use the loc method as follows,

df.loc[df.age > 50, "height"] *= 100

print("\n\n Printing the original df")
print(df)
 Printing the original df
     name  age  height
0    Bill  100  173.00
1    Jane   98  162.00
2     Sue   24    1.83
3  Xingyi   11    1.11
4  Maryam   64  154.00

Note that neither of the above solutions will generate the SettingWithCopyWarning.

Note

The above example is only for illustrative purposes. Making a DataFrame in which the height of people over 50 is measured in centimetres but the height of people under 50 is measured in metres is not a good idea!

4. Submit your own#

If you have any suggestions for other common problems to include, please let me know.