040 Processing Data with Pandas#

COM6018

Copyright © 2023, 2024 Jon Barker, University of Sheffield. All rights reserved.

1. Introduction#

1.1 What is Pandas?#

So far in this module, we have seen how data can be processed very flexibly using native Python code and Python dictionary and list data structures. However, this flexibility comes at a cost: Processing data in this way can be slow and is not suitable for large datasets. We have also seen how using NumPy arrays allows us to process numeric data very efficiently. NumPy is excellent for mathematical computation, linear algebra, and a lot of the matrix/vector calculations that occur in machine learning. However, it is not so good for processing more complex structured data, which might have a mixture of data types (e.g., numeric, string, categorical, etc.) and missing values, etc. This is where Pandas comes in.

Pandas is an open source 3rd party library designed for the needs of data scientists. It was originally built on top of NumPy, which allows it to be very fast, but it offers a lot of functionality for processing structured data that is not available in NumPy.

Although Pandas was originally built around NumPy, NumPy was not itself designed for this purpose, and that has led to some inefficiencies. However, Pandas has recently had a major new release (Pandas 2.0), which now allows the NumPy ‘backend’ to be replaced with one built around Apache Arrow – a framework that has been designed from the ground-up for efficient processing of large table-like datasets.

This tutorial is based on Pandas 2.0. The complete documentation can be found at https://pandas.pydata.org/docs/.

1.2 Installing Pandas#

As Pandas is not part of the Python standard library, it must first be installed before it can be used. This can be done easily using the conda package manager. (If you are using the Anaconda distribution of Python, then Pandas will already be installed. Or if you have built a conda environment using the com6018.environment.yml file provided with the module, then Pandas will also already be installed.)

If Pandas is not installed, you can install it by typing the following command in your terminal.

# conda install -c conda-forge pandas

This will install the latest version of Pandas.

To use Pandas in your Python code, you simply need to import it at the start of your program, e.g.,

import pandas as pd

Conventional Python practice is to import Pandas using the alias pd. This is not essential, but it is recommended. Following existing conventions will make your code easier to read for other Python programmers.

1.3 Basic Pandas Concepts#

Pandas is designed for processing structured data much like the data that would be stored in a spreadsheet or a csv file. Pandas organises data using two basic data structures: the Series and the DataFrame. A Series is a one-dimensional array of data that can be thought of as a single column in a spreadsheet. A DataFrame represents a table of data, which is represented as a collection of Series objects. A DataFrame can be thought of as being similar to an entire spreadsheet.

Pandas has its own type system for representing data which is the same as that for NumPy but with a few extensions for representing things like dates and intervals. Every Series (i.e., spreadsheet column) will have its own type but there can be a different type for each Series.

Pandas offers a lot of functionality for processing data. This includes operations such as filtering tables, joining tables, grouping data, etc. It also includes a lot of functionality for reading and writing data in different formats such as CSV, Excel, JSON, SQL, etc. It also has some functionality for visualising data which is generally easier to use than the equivalent functionality in Matplotlib.

2. Reading Data#

Pandas has a range of methods for reading data from different standard data formats. For example, it can read Excel spreadsheets directly with a simple read_excel() method. However, continuing from our previous tutorials, we will focus on the commonly used CSV and JSON formats.

2.1 Reading CSV Data#

Reading a CSV file with Pandas can be done with a single line of code:

import pandas as pd

df = pd.read_csv('data/windfarm.csv')

The read_csv function will parse the file and return the results as a Pandas DataFrame object. (It is common to use the variable name df for a DataFrame object in simple code examples. You may want to give the DataFrame a more descriptive name when writing more complex code.)

We can now print the resulting DataFrame object to see what it contains:

print(df)
       id   "turbines"   "height"   "power"
0  WF1355           13         53   19500.0
1  WF1364            3         60    8250.0
2  WF1356           12         60   24000.0
3  WF1357           36         60   72000.0

This will produce output that looks like the following,

       id   "turbines"   "height"   "power"
0  WF1355           13         53     19500.0
1  WF1364            3         60      8250.0
2  WF1356           12         60     24000.0
3  WF1357           36         60     72000.0

Note that the read_csv function has read the names of the fields from the first line of the csv file and used these to name the DataFrame columns. If our csv file did not have column names in the first line, then we would need to specify them using the names parameter, e.g.,

df = pd.read_csv('data/windfarm.no_names.csv', names=['id', 'turbines', 'height', 'power'])
print(df)
       id  turbines  height    power
0  WF1355        13      53  19500.0
1  WF1364         3      60   8250.0
2  WF1356        12      60  24000.0
3  WF1357        36      60  72000.0

Note also how Pandas has added an extra column without a title to the left of the data. This is known as the index column. The index column is used to uniquely identify each row in the DataFrame. Pandas has generated this index for us by giving the rows consecutive integer index values starting from 0. If we want to use a different column as the index, then we can specify this using the index_col parameter, e.g.,

df_id = pd.read_csv('data/windfarm.csv', index_col='id')
print(df_id)
        "turbines"  "height"  "power"
id                                   
WF1355          13        53  19500.0
WF1364           3        60   8250.0
WF1356          12        60  24000.0
WF1357          36        60  72000.0

This will produce output looking like the following,

         "turbines"   "height"   "power"
id
WF1355           13         53   19500.0
WF1364            3         60    8250.0
WF1356           12         60   24000.0
WF1357           36         60   72000.0

Tip

You typically want the values in the index column to be unique. If they are not unique, then Pandas will still allow you to use the column as the index, but it will not be able to use some of the more advanced functionality that it offers for indexing. If you are not sure about which column to use as the index, then it is often best not to specify one and to just allow Pandas to create its own index column.

2.2 Reading JSON Data#

Reading JSON data is just as easy as reading CSV data. We can read the climate JSON data from our previous tutorial with a single line of code:

import pandas as pd

df = pd.read_json('data/climate.json')

We can now print the resulting DataFrame object to see what it contains:

print(df)
      id            city        country  \
0      1       Amsterdam    Netherlands   
1      2          Athens         Greece   
2      3      Atlanta GA  United States   
3      4        Auckland    New Zealand   
4      5       Austin TX  United States   
..   ...             ...            ...   
100  101  Albuquerque NM  United States   
101  102      Vermont IL  United States   
102  103    Nashville TE  United States   
103  104    St. Louis MO  United States   
104  105  Minneapolis MN  United States   

                                            monthlyAvg  
0    [{'high': 7, 'low': 3, 'dryDays': 19, 'snowDay...  
1    [{'high': 12, 'low': 7, 'dryDays': 21, 'snowDa...  
2    [{'high': 12, 'low': 2, 'dryDays': 18, 'snowDa...  
3    [{'high': 23, 'low': 16, 'dryDays': 24, 'snowD...  
4    [{'high': 18, 'low': 6, 'dryDays': 15, 'snowDa...  
..                                                 ...  
100  [{'high': 10, 'low': -4, 'dryDays': 24, 'snowD...  
101  [{'high': 3, 'low': -8, 'dryDays': 18, 'snowDa...  
102  [{'high': 9, 'low': -1, 'dryDays': 18, 'snowDa...  
103  [{'high': 7, 'low': -4, 'dryDays': 16, 'snowDa...  
104  [{'high': -3, 'low': -13, 'dryDays': 20, 'snow...  

[105 rows x 4 columns]

Note that the monthlyAvg data which was a list of dictionaries in the JSON file now appears in a single column of the DataFrame. This is not a very convenient format for processing the data. We can use a method called apply() to convert this column into a DataFrame with one column per each dictionary in the list. Using apply() a second time can then expand the dictionaries into columns.

df_monthly = df['monthlyAvg'].apply(pd.Series)
print(df_monthly[0].apply(pd.Series))
     high   low  dryDays  snowDays  rainfall
0     7.0   3.0     19.0       4.0      68.0
1    12.0   7.0     21.0       1.0      53.0
2    12.0   2.0     18.0       2.0      99.5
3    23.0  16.0     24.0       0.0      25.6
4    18.0   6.0     15.0       1.0      67.8
..    ...   ...      ...       ...       ...
100  10.0  -4.0     24.0       4.0      10.7
101   3.0  -8.0     18.0      11.0      52.1
102   9.0  -1.0     18.0       8.0      84.4
103   7.0  -4.0     16.0      10.0      69.8
104  -3.0 -13.0     20.0      16.0      19.9

[105 rows x 5 columns]

This is a little complicated, but it is typical of the preprocessing steps we might need to take if working with JSON files that have a complex nested structure. Fortunately, most of the data we will be working with will be in CSV format or in a simple JSON format with a flat structure.

3. The DataFrame Object#

We will now look in a bit more detail at the DataFrame object. For this section we will be using a dataset that records the extent of sea ice in the Arctic and Antarctic. This data is available from the National Snow and Ice Data Center. The data is available in CSV format and we have already downloaded it and saved it in the data directory as seaice.csv.

We will read the data and print the DataFrame,

import pandas as pd

df = pd.read_csv('data/seaice.csv')
print(df)
       Year   Month   Day       Extent      Missing  \
0      1978      10    26       10.231          0.0   
1      1978      10    28       10.420          0.0   
2      1978      10    30       10.557          0.0   
3      1978      11     1       10.670          0.0   
4      1978      11     3       10.777          0.0   
...     ...     ...   ...          ...          ...   
26349  2019       5    27       10.085          0.0   
26350  2019       5    28       10.078          0.0   
26351  2019       5    29       10.219          0.0   
26352  2019       5    30       10.363          0.0   
26353  2019       5    31       10.436          0.0   

                                             Source Data hemisphere  
0       ['ftp://sidads.colorado.edu/pub/DATASETS/nsid...      north  
1       ['ftp://sidads.colorado.edu/pub/DATASETS/nsid...      north  
2       ['ftp://sidads.colorado.edu/pub/DATASETS/nsid...      north  
3       ['ftp://sidads.colorado.edu/pub/DATASETS/nsid...      north  
4       ['ftp://sidads.colorado.edu/pub/DATASETS/nsid...      north  
...                                                  ...        ...  
26349  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      south  
26350  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      south  
26351  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      south  
26352  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      south  
26353  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      south  

[26354 rows x 7 columns]

Note that when printing a DataFrame, Pandas will only show the first and last few rows and columns. We can see from the output that the DataFrame has 26354 rows and measurements that date back to 1978. But we cannot see the full set of columns.

To get a list of all the columns in the DataFrame, we can use the columns attribute,

print(df.columns)
Index(['Year', ' Month', ' Day', '     Extent', '    Missing', ' Source Data',
       'hemisphere'],
      dtype='object')

This will produce output that looks like the following,

Index(['Year', ' Month', ' Day', '     Extent', '    Missing', ' Source Data',
       'hemisphere'],
      dtype='object')

We can now see that the data are composed of seven columns: year, month, day, extent, missing, source data, and hemisphere. The extent and missing columns record the extent of the sea ice and the number of missing measurements for each day. The source data column records the source of the data. The hemisphere column records whether the data are for the Arctic or Antarctic.

Note that space characters have become incorporated into the column names. This is because the columns in the CSV file had spaces after the commas. We can easily skip these spaces by adding the skipinitialspace=True parameter to the read_csv() method,

import pandas as pd

df = pd.read_csv('data/seaice.csv', sep=',', skipinitialspace=True)
print(df.columns)
Index(['Year', 'Month', 'Day', 'Extent', 'Missing', 'Source Data',
       'hemisphere'],
      dtype='object')

3.1 Selecting a Column#

We can select a single column from the DataFrame using the column name as an index.

extent = df['Extent']
print(extent)
0        10.231
1        10.420
2        10.557
3        10.670
4        10.777
          ...  
26349    10.085
26350    10.078
26351    10.219
26352    10.363
26353    10.436
Name: Extent, Length: 26354, dtype: float64

This will produce output looking like the following,

0        10.231
1        10.420
2        10.557
3        10.670
4        10.777
          ...
26349    10.085
26350    10.078
26351    10.219
26352    10.363
26353    10.436
Name: Extent, Length: 26354, dtype: float64

Notice how this looks a bit different from when we printed the DataFrame. This is because the column is not a DataFrame, it is a Series. Remember, Pandas uses the Series object for storing individual columns of data. We can check this by printing the type of the column,

print(type(df))  # This will print <class 'pandas.core.frame.DataFrame'>
extent = df['Extent']
print(type(extent)) # This will print <class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

The Series object has a specific type (called its dtype). In this case the type is float64 which means that the data is stored as 64-bit floating point numbers. We can check the type of the data in a Series using the dtype attribute. Or, more conveniently, we can retrieve the data type of all the columns in the DataFrame using the DataFrame’s dtypes attribute. Let’s do that now,

print(df.dtypes)
Year             int64
Month            int64
Day              int64
Extent         float64
Missing        float64
Source Data     object
hemisphere      object
dtype: object

This will produce output looking like the following,

Year             int64
Month            int64
Day              int64
Extent         float64
Missing        float64
Source Data     object
hemisphere      object
dtype: object

We can see that the Year, Month and Day have integer values; the Extent and Missing columns have floating-point values; and the Source Data and hemisphere columns have ‘object’ values.

Pandas has inferred these types from the content of the csv file. It usually guesses correctly, but it can sometimes get it wrong. If it is unsure it will typically default to the ‘object’ type. This has happened for the ‘Source Data’ and ‘hemisphere’ columns which should be strings. We can fix this by explicitly converting these columns to strings using the astype() method,

df['Source Data'] = df['Source Data'].astype("string")
df['hemisphere'] = df['hemisphere'].astype("string")
print(df.dtypes)
Year                    int64
Month                   int64
Day                     int64
Extent                float64
Missing               float64
Source Data    string[python]
hemisphere     string[python]
dtype: object

A Series object is much like a list, and we can use indexes and slicing operators to reference elements in the Series. For example, we can get the first element in the Series using,

extent = df['Extent']
print(extent[0])  # print the first element
10.231

or we can retrieve the sequence of elements from the 6th to 10th with,

extent = df['Extent']
print(extent[5:10]) # print a subseries from index 5 to 9
5    10.968
6    11.080
7    11.189
8    11.314
9    11.460
Name: Extent, dtype: float64

(Recall that the first element of a series has index 0, so the 10th element has index 9. The slicing notation a:b means to go from index a to b-1 inclusive.)

3.2 Selecting a Row#

We can select a single row from the DataFrame using the iloc attribute. This attribute takes a single integer index and returns a Series containing the data from the row at that index. For example, to select the first row, we can use,

row = df.iloc[0]
print(row)
Year                                                        1978
Month                                                         10
Day                                                           26
Extent                                                    10.231
Missing                                                      0.0
Source Data    ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...
hemisphere                                                 north
Name: 0, dtype: object

Notice how the results have been returned as a Series object. We can check this by printing the type of the row,

print(type(row)) # This will print <class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

So DataFrames are a collection of Series objects representing the columns, but if we extract a single row, we get a Series object representing the row. The dtype of the row series is ‘object’. An object type can basically store anything, so it is being used here because each element in the row series has a different type.

Alternatively, we can extract a range of rows using slicing,

rows = df.iloc[0:5]
print(rows)
   Year  Month  Day  Extent  Missing  \
0  1978     10   26  10.231      0.0   
1  1978     10   28  10.420      0.0   
2  1978     10   30  10.557      0.0   
3  1978     11    1  10.670      0.0   
4  1978     11    3  10.777      0.0   

                                         Source Data hemisphere  
0  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      north  
1  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      north  
2  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      north  
3  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      north  
4  ['ftp://sidads.colorado.edu/pub/DATASETS/nsidc...      north  

Note that a collection of rows is still a DataFrame. We can check this by printing the type of the variable rows,

print(type(rows))
<class 'pandas.core.frame.DataFrame'>

Basically, if we have a single row or column, then we are dealing with a Series, but if we have multiple rows and columns, then we are dealing with a DataFrame.

3.3 Filtering a DataFrame#

Very often in data science, we need to select some subset of our data based on some condition, e.g., all cities where the population is greater than 1 million or all years where the average temperature was less than 15 degrees Celsius. Or, more simply, we may wish to select every row that has an even index value. Alternatively, we may want to remove various columns from the DataFrame, i.e., those representing fields that we are not interested in. We will refer to these operations that take a DataFrame and return another smaller DataFrame as filtering operations. In this section, we will look at a few simple examples.

3.3.1 Removing or Selecting Columns#

After loading our data, we may want to remove columns that we are not interested in. This is easily done using the drop() method. For example, to make it easier to print the data frame, we will remove the Source Data column using,

df = df.drop(columns=['Source Data'])

There are now only 6 columns in the DataFrame and they can all be viewed at once,

print(df)
       Year  Month  Day  Extent  Missing hemisphere
0      1978     10   26  10.231      0.0      north
1      1978     10   28  10.420      0.0      north
2      1978     10   30  10.557      0.0      north
3      1978     11    1  10.670      0.0      north
4      1978     11    3  10.777      0.0      north
...     ...    ...  ...     ...      ...        ...
26349  2019      5   27  10.085      0.0      south
26350  2019      5   28  10.078      0.0      south
26351  2019      5   29  10.219      0.0      south
26352  2019      5   30  10.363      0.0      south
26353  2019      5   31  10.436      0.0      south

[26354 rows x 6 columns]

Rather than removing columns, an alternative is to actively select the subset of columns that we wish to keep. This is done with the following indexing notation,

df = df[['Year', 'Month', 'Day', 'Extent', 'Missing', 'hemisphere']]

3.3.2 Selecting Rows#

Very often when analysing data we want to select a subset of the samples entries that match some criterion, i.e., to select a subset of the rows in our DataFrame. This is very straightforward in Pandas and is done using a Boolean expression as the index value.

For example, lets say we want to select all the rows where the sea ice extent value is less than 5.0. We can do this using,

selector = df['Extent'] < 5.0
df_low_extent = df[selector]

We have done this in two steps. The first line takes the Extent series and compares every value to 5.0. This produces a series of boolean (i.e, True or False) values which we have stored in a variable called selector. The second line uses this boolean series as an index which has the effect of selecting only the rows where the value is True.

If we now print the resulting DataFrame we can see that it only contains about 3000 rows of the original 26,354 rows in the dataset, i.e., these are the samples where the sea ice extent has fallen below 5.0 billion square kilometers.

print(df_low_extent)
       Year  Month  Day  Extent  Missing hemisphere
8874   2007      8   20   4.997      0.0      north
8875   2007      8   21   4.923      0.0      north
8876   2007      8   22   4.901      0.0      north
8877   2007      8   23   4.872      0.0      north
8878   2007      8   24   4.837      0.0      north
...     ...    ...  ...     ...      ...        ...
26296  2019      4    4   4.564      0.0      south
26297  2019      4    5   4.730      0.0      south
26298  2019      4    6   4.856      0.0      south
26299  2019      4    7   4.902      0.0      south
26300  2019      4    8   4.993      0.0      south

[2988 rows x 6 columns]

They are ordered by year. What do you notice about the earliest year of these data? Recall that the original complete dataset had recordings going back to 1979, but all the monthly recordings where the extent of the sea ice fallen below 5.0 billion square kilometers have occurred since 2007. Is this something that might have happened by chance, or is it evidence of a trend in the data? We will see how to answer such questions a little later in the module.

Let us say that we now just want to look at the Arctic data. We can do this by selecting only the rows where the hemisphere is ‘north’. We can do this using the eq() method,

filter = df['hemisphere'].eq('north')

If we wanted to combine these two filters to select only the rows where the sea ice extent is less than 5.0 billion square kilometers and the hemisphere is ‘north’, we could do this using the & operator,

filter = (df['hemisphere'].eq('north')) & (df['Extent'] < 5.0)
df_low_extent_north = df[filter]
print(df_low_extent_north)
       Year  Month  Day  Extent  Missing hemisphere
8874   2007      8   20   4.997      0.0      north
8875   2007      8   21   4.923      0.0      north
8876   2007      8   22   4.901      0.0      north
8877   2007      8   23   4.872      0.0      north
8878   2007      8   24   4.837      0.0      north
...     ...    ...  ...     ...      ...        ...
12932  2018      9   29   4.846      0.0      north
12933  2018      9   30   4.875      0.0      north
12934  2018     10    1   4.947      0.0      north
12935  2018     10    2   4.988      0.0      north
12936  2018     10    3   4.975      0.0      north

[339 rows x 6 columns]

This leaves just 339 rows.

Note

Be careful to use parentheses around each filter expression above. The & operator has a higher precedence than the < operator so if you do not use parentheses then it will still be a valid the expression but it will not work in the way you expected.

Note that we use a very similar indexing syntax, i.e., df[selector], for selecting both rows and columns. The difference is that when selecting rows we use a Boolean series as the index value, whereas when selecting columns we use a list of column names as the index value. This can be a little confusing at first, but you will soon get used to it.

4 Grouping and averaging values#

4.1 Operating on columns#

There are many operations that we might apply over an entire column to compute some ‘statistic’ of that column. For example, we might want to compute a columns minimum, maximum, mean, median, standard deviation, etc. The Series object provides many methods to make this easy.

For example, to compute the mean of the Extent column we can use,

mean_value = df['Extent'].mean()
print(mean_value)
11.494986301889654

We can also calculate the minimum and maximum values in a similar way,

print(df['Extent'].min(), df['Extent'].max())
2.08 20.201

For convenience, the DataFrame object provides a describe() method that will compute and display the basic statistics for all columns in the DataFrame that have numerical values. For example,

print(df.describe())
               Year         Month           Day        Extent       Missing
count  26354.000000  26354.000000  26354.000000  26354.000000  26354.000000
mean    2000.591941      6.507399     15.740685     11.494986      0.000003
std       10.896821      3.451938      8.801607      4.611734      0.000227
min     1978.000000      1.000000      1.000000      2.080000      0.000000
25%     1992.000000      4.000000      8.000000      7.601000      0.000000
50%     2001.000000      7.000000     16.000000     12.217000      0.000000
75%     2010.000000     10.000000     23.000000     15.114000      0.000000
max     2019.000000     12.000000     31.000000     20.201000      0.024000

This will print the following table:

               Year         Month           Day        Extent       Missing
count  26354.000000  26354.000000  26354.000000  26354.000000  26354.000000
mean    2000.591941      6.507399     15.740685     11.494986      0.000003
std       10.896821      3.451938      8.801607      4.611734      0.000227
min     1978.000000      1.000000      1.000000      2.080000      0.000000
25%     1992.000000      4.000000      8.000000      7.601000      0.000000
50%     2001.000000      7.000000     16.000000     12.217000      0.000000
75%     2010.000000     10.000000     23.000000     15.114000      0.000000
max     2019.000000     12.000000     31.000000     20.201000      0.024000

The table shows the number of items (count), mean, standard deviation (std), minimum (min), maximum (max) and the 25th, 50th, and 75th percentiles of the data in each column. Note that the 50th percentile is the same as the median. These statistics will be more meaningful for some columns than for others. For example, in the above the mean and standard deviation of the Month and Day columns are not very meaningful. Nevertheless, the describe() methods is very useful for getting a quick overview of the data and is often the first thing we will do when we load a new dataset.

4.2 Grouping data#

In the above, we have computed the statistics of the entire data set. It is more common to want to compute statistics after grouping the data into meaningful subsets. For example, we might be interested in the mean sea ice extent for each year, or for each month, or for the north and south hemispheres, separately. To compute such statistics, we need to group the data into subsets and then compute the statistics for each subset.

Grouping the data is done using the groupby() method. For example, to group the data by month we can use

grouped = df.groupby('Month')

This returns a special DataFrameGroupBy object. This object is a special type of DataFrame that contains the data grouped by the column we specified. We can then compute statistics on groups by selecting a column and using the same methods that we used earlier. For example, to compute the mean sea ice extent for each month we can use

print(grouped['Extent'].mean())
Month
1      9.587706
2      9.057401
3      9.621926
4     10.678099
5     11.637163
6     12.476158
7     12.563403
8     12.288473
9     12.265863
10    13.044214
11    13.152069
12    11.514823
Name: Extent, dtype: float64

The code above has produced a mean value for each month of the year (1 is January, 2 is February, etc.). Notice how the amount of ice reaches a maximum in November and a minimum in February. This might seem somewhat surprising to you, e.g., the maximum and minimum are not six months apart, as might be expected. Note, however, that when we grouped by month we included both Northern and Southern hemisphere measurements, so this mean will be the average across both hemispheres, and this can lead to an unintuitive outcome.

Let us say that we wanted to look at the sea ice extent for each month but separately for the northern and southern hemisphere. We now want to group by both month and hemisphere. We can do this easily by passing a list of column names to the groupby() method,

grouped = df.groupby(['hemisphere', 'Month'])
print(grouped['Extent'].mean())
hemisphere  Month
north       1        14.176006
            2        15.044512
            3        15.205876
            4        14.481141
            5        13.072172
            6        11.516788
            7         9.090173
            8         6.784248
            9         5.988981
            10        7.929377
            11       10.448120
            12       12.638618
south       1         4.999406
            2         3.070290
            3         4.037976
            4         6.875058
            5        10.202155
            6        13.435529
            7        16.036633
            8        17.792698
            9        18.542745
            10       18.159051
            11       15.856019
            12       10.391028
Name: Extent, dtype: float64

Note that when printing out a DataFrameGroupBy object that has been grouped by multiple fields, Pandas will produce a multicolumn table with the first column being the first field in the groupby list, the second column being the second field in the groupby list, etc. In this case, the first column is the hemisphere, and the second column is the month.

We now see that in the Arctic, sea ice reaches a maximum in March and a minimum in September. In the Antarctic, sea ice reaches a maximum in September and a minimum in February. This is more in line with what we might expect, that is, the maxima and minima are about 6 months apart. The dates might seem a little later than what you would consider to be the peak of winter and summer. This is because of the way water stores heat. The oceans take a long time to cool down and warm up, so the coldest and warmest months are not the same as the months with the shortest and longest days. You can now also see that the Antarctic sea ice extent is much more variable than the Arctic sea ice extent.

Let us now consider how we would examine the Arctic and Antarctic sea ice minimum over the years. We can do this by grouping by hemisphere and year and then selecting the minimum value in each group,

grouped = df.groupby(['hemisphere', 'Year'])
print(grouped['Extent'].min())
hemisphere  Year
north       1978    10.231
            1979     6.895
            1980     7.533
            1981     6.902
            1982     7.160
                     ...  
south       2015     3.544
            2016     2.616
            2017     2.080
            2018     2.150
            2019     2.444
Name: Extent, Length: 84, dtype: float64

Pandas has built-in methods to make plots. Below we are making a plot and then getting the figure from the plot and saving it to a file. The resulting plot is being imported back into these notes.

grouped = df.groupby(['hemisphere', 'Year'])
ice_extent = grouped['Extent'].min()['north']

my_plot = ice_extent.plot()
my_plot.get_figure().savefig("figures/040_1.jpeg")
../../_images/e1d1dea3e8b16547c8bff0362bc886ba3a277c5dc9e209d1aa2b9981a20c49e5.png

You can see that the sea ice extent has been broadly trending downwards but the values at the extremes of the x-axis look a bit odd, i.e., there are very large values for the first and last year in the dataset. This is an ‘artifact’ caused by the data for these years being incomplete. The data for 1978 only start in November and the data for 2019 only go up to July. The code has computed the average for the few months that are available, but this is not representative of the average for the full year. We can fix this problem by simply removing these years from the data before plotting.

To remove the first and last values in a series we can use the familiar slicing notation that we saw when using NumPy, ‘[1:-1]’. So, the code will now look like

grouped = df.groupby(['hemisphere', 'Year'])
ice_extent = grouped['Extent'].min()['north']
ice_extent = ice_extent[1:-1]  # remove first and last years

my_plot = ice_extent.plot(title='Arctic sea ice minimum', ylabel='million square km')
my_plot.get_figure().savefig("figures/040_2.jpeg")
../../_images/f1ccdd647e0cd19c3e44a49836a81233da2b09a1c0e03b3c525730570e6cd59c.png

Notice how the sea ice minimum fluctuates seemingly at random from year to year. This kind of unexplained variation in data is often referred to as noise. However, ignoring the fluctuations, we can see that there also appears to be a downward trend in the sea ice minimum. If we want to be sure that there is a genuine downward trend and that the low values in recent years are not just part of the usual random fluctuations, then we need to perform some statistical analysis. We will look at how to do this later in the module.

5 Dealing with Missing Data#

Very often we need to work with data sets that have ‘missing values’, i.e., for one or more fields in the data there is no value available. These are typically encoded in the data file using a special symbol such as ‘NaN’. For example, we may have some employee data that records an ID, a surname and age, and a gender, but there may be some employees that have not provided either their age or gender or both.

We can read and display some example data.

df = pd.read_csv('data/employee.csv')

Note, below, the presence of ‘NaN’ in the age and gender columns.

print(df)
   ID    Surname   Age  Gender
0   1      Smith  32.0    Male
1   2    Johnson  45.0  Female
2   3   Gonzalez   NaN     NaN
3   4         Li  37.0  Female
4   5      Brown  41.0    Male
5   6  Rodriguez  29.0  Female
6   7      Kumar  32.0     NaN
7   8      Lopez  36.0  Female
8   9     Nguyen   NaN    Male
9  10      Patel  40.0    Male

There are some simple functions to test for the presence of these missing values. For example, the isnull() method will produce a new dataframe of boolean values with a value of True in every location where an item of data is missing.

print(pd.isnull(df))
      ID  Surname    Age  Gender
0  False    False  False   False
1  False    False  False   False
2  False    False   True    True
3  False    False  False   False
4  False    False  False   False
5  False    False  False   False
6  False    False  False    True
7  False    False  False   False
8  False    False   True   False
9  False    False  False   False

Typically if we have missing values we will need to take some action to deal with them before progressing with our analysis. Depending on the situation, the most appropriate action might be to simply remove the rows with missing values, or to fill in the missing values with some sensible default value, or to use some more sophisticated method for estimating what the missing values would have been if they had been present.

5.1 Filling missing data with a fixed value#

We can fill in missing values using the fillna method. For example, if displaying the data, we may simply want to replace the Gender NaNs with something more readable, such as ‘Gender not provided.’ This is done as follows.

df['Gender'].fillna("Gender not provided", inplace = True)
/tmp/ipykernel_2273/1817109800.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].fillna("Gender not provided", inplace = True)
print(df)
   ID    Surname   Age               Gender
0   1      Smith  32.0                 Male
1   2    Johnson  45.0               Female
2   3   Gonzalez   NaN  Gender not provided
3   4         Li  37.0               Female
4   5      Brown  41.0                 Male
5   6  Rodriguez  29.0               Female
6   7      Kumar  32.0  Gender not provided
7   8      Lopez  36.0               Female
8   9     Nguyen   NaN                 Male
9  10      Patel  40.0                 Male

Note the parameter inplace = True. This tells Pandas to modify the DataFrame ‘in place’ rather than making and returning a modified copy of the original DataFrame (which is what happens when you set inplace = False, or use the default behaviour). This ‘in place’ modification is a common pattern in Pandas and many methods have this inplace parameter. For large DataFrames, it is often desirable to modify the DataFrame in place to avoid the memory overhead of creating a new DataFrame. However, it can be a dangerous style of programming that can lead to subtle errors. For example, if we have a function that modifies a DataFrame in place but which also returns the modified DataFrame, then the user of the function may be unaware that the DataFrame they are passing as a parameter is going to be changed by the function. This unexpected behaviour can easily lead to bugs. It is often better to leave the inplace attribute set to False and to assign the result of the method to a new variable. This is a safer style of programming and should be preferred unless you are sure that it is causing a performance problem.

The fillna method has other useful options such as a method parameter that can fill the missing value with the previous or next valid observation in the Series. This behaviour is useful for sequential data, e.g., data ordered by date, for example. For example, if there is a weather station but it has failed to record a temperature on a particular day, then we might opt to fill in the missing value with the temperature recorded on the previous day as this is likely to be a reasonably good estimate.

5.2 Filling missing values using ‘mean imputation’#

Another common strategy for filling missing values in numeric data is to fill the missing values with the mean of the values that are present. In the data science literature, this is known as ‘mean imputation’. This might be a sensible way of treating our missing employee age data and can be easily done using,

df['Age'].fillna(df['Age'].mean(), inplace = True)
/tmp/ipykernel_2273/2940318962.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace = True)

This will produce the following.

print(df)
   ID    Surname   Age               Gender
0   1      Smith  32.0                 Male
1   2    Johnson  45.0               Female
2   3   Gonzalez  36.5  Gender not provided
3   4         Li  37.0               Female
4   5      Brown  41.0                 Male
5   6  Rodriguez  29.0               Female
6   7      Kumar  32.0  Gender not provided
7   8      Lopez  36.0               Female
8   9     Nguyen  36.5                 Male
9  10      Patel  40.0                 Male

Note how the missing age values for Gonzalez and Nguyen have now been replaced with 36.5. Obviously, this is unlikely to be their true ages, but it might be a sensible guess to use in the analysis stages that are going to follow.

Treatment of missing data is a big topic and something we will return to later in the module.

6 Collating data#

Very often, as Data Scientists, we need to combine data from different sources. For example, we might have two separate datasets that are linked by a shared field, such as the customer ID, date, or location. These will typically start lief as separate files. These can be read into two separate DataFrames but in order to understand the data we will generally need to combine these DataFrames into a new common DataFrame, in some way.

In the simple example below, we will consider two different sets of atmospheric gas readings, carbon dioxide and methane that have been measured at the same location over time but are stored separately in the files co2.csv and ch4.csv. (The file names are based on the chemical formulae for carbon dioxide and methane, CO2 and CH4, respectively).

Let us imagine that the first few rows of these tables look like this

year, month, day, co2
2020, 1, 02, 442
2020, 1, 03, 444
2020, 1, 04, 441
2020, 1, 07, 446

and this,

year, month, day, ch4
2020, 1, 02, 442
2020, 1, 03, 444
2020, 1, 05, 441
2020, 1, 06, 442
2020, 1, 07, 446

We can see that both tables share the same year, month and day columns but each have their own unique column for the gas concentration measurement. We can also see that the data are not recorded on every day: on some days there is both a carbon dioxide and a methane measurement, on some there is only one or the other and on some dates there is neither. We will now consider different ways in which these datasets can be combined.

6.1 Combining DataFrames using the merge method#

To make the task a bit more instructive, the above data have some missing entries, e.g., entries for days 05 and 06 are missing from the CO2 data. The entry for day 04 is missing from the CH4 data. In a well-designed dataset these might have been recorded with NaN values but often there are just gaps in the data.

We can read these tables using,

co2_df = pd.read_csv('data/co2_ex.csv', skipinitialspace=True)
ch4_df = pd.read_csv('data/ch4_ex.csv',  skipinitialspace=True)
print(co2_df)
print(ch4_df)
   year  month  day  co2
0  2020      1    2  442
1  2020      1    3  444
2  2020      1    4  441
3  2020      1    7  446
   year  month  day  ch4
0  2020      1    2  442
1  2020      1    3  444
2  2020      1    5  441
3  2020      1    6  442
4  2020      1    7  446

We can now merge the tables. To do this, we need to pick a column to merge on. This is typically a key-like value that uniquely identifies the data entry (i.e., the row) and appears in both tables. If no one column is unique then we can use a collection of columns to merge on. For example, in our data, neither the year, month nor day would be a unique identifier if taken in isolation, but the three together form a unique set of values (i.e., because there is only one measurement per day, no two rows have an identical year, month and day). So we will use on=['year', 'month', 'day'].

We can then merge in four different ways: inner, outer, left or right. These differ in how they treat missing values: inner will only merge rows in which the row exists in both DataFrames (i.e., if you think of this a merging two sets then this would be an ‘intersection’); outer will include rows that exist in either of the two DataFrames (i.e., in terms of sets, this would be a ‘union’); left only includes rows which are in the first DataFrame and right only those in the second. This can be most easily understood by comparing the outputs below.

  • Merging CO2 and CH4 using an inner merge:

combined_df = co2_df.merge(ch4_df, on=['year', 'month', 'day'], how='inner')
print(combined_df)
   year  month  day  co2  ch4
0  2020      1    2  442  442
1  2020      1    3  444  444
2  2020      1    7  446  446
  • Merging CO2 and CH4 using an outer merge:

combined_df = co2_df.merge(ch4_df, on=['year', 'month', 'day'], how='outer')
print(combined_df)
   year  month  day    co2    ch4
0  2020      1    2  442.0  442.0
1  2020      1    3  444.0  444.0
2  2020      1    4  441.0    NaN
3  2020      1    5    NaN  441.0
4  2020      1    6    NaN  442.0
5  2020      1    7  446.0  446.0
  • Merging CO2 and CH4 using a left merge:

combined_df = co2_df.merge(ch4_df, on=['year', 'month', 'day'], how='left')
print(combined_df)
   year  month  day  co2    ch4
0  2020      1    2  442  442.0
1  2020      1    3  444  444.0
2  2020      1    4  441    NaN
3  2020      1    7  446  446.0
  • Merging CO2 and CH4 using a right merge:

combined_df = co2_df.merge(ch4_df, on=['year', 'month', 'day'], how='right')
print(combined_df)
   year  month  day    co2  ch4
0  2020      1    2  442.0  442
1  2020      1    3  444.0  444
2  2020      1    5    NaN  441
3  2020      1    6    NaN  442
4  2020      1    7  446.0  446

In all cases, in the resulting table, any missing values will be automatically filled in with NaNs.

Warning:

Typically, the join columns act as a unique key. What happens if we choose columns that are not unique? Below we attempt to join on the ‘year’ column which is actually the same for all our rows, i.e.,

combined_df = co2_df.merge(ch4_df, on=['year'], how='inner')
print(combined_df)
    year  month_x  day_x  co2  month_y  day_y  ch4
0   2020        1      2  442        1      2  442
1   2020        1      2  442        1      3  444
2   2020        1      2  442        1      5  441
3   2020        1      2  442        1      6  442
4   2020        1      2  442        1      7  446
5   2020        1      3  444        1      2  442
6   2020        1      3  444        1      3  444
7   2020        1      3  444        1      5  441
8   2020        1      3  444        1      6  442
9   2020        1      3  444        1      7  446
10  2020        1      4  441        1      2  442
11  2020        1      4  441        1      3  444
12  2020        1      4  441        1      5  441
13  2020        1      4  441        1      6  442
14  2020        1      4  441        1      7  446
15  2020        1      7  446        1      2  442
16  2020        1      7  446        1      3  444
17  2020        1      7  446        1      5  441
18  2020        1      7  446        1      6  442
19  2020        1      7  446        1      7  446

This has taken the set of rows from each table that share the same year and made new rows using every possible pairing of rows in these sets. Also because ‘month’ and ‘day’ were not in the join, the new DataFrame now has separate month and day columns coming from the two original DataFrames which have been automatically distinguished with the suffixes ‘_x’ and ‘_y’. There are actually instances when we might want this behaviour, but if your output is looking like this, it is more likely that it’s because you have set the on parameter incorrectly.

6.2 Combining DataFrames using the join method#

The join function is very similar to merge except that it uses the DataFrame index to decide which rows to combine. It can be thought of as a special case of merge, however, it is often more convenient and faster to use. If the DataFrame already has a meaningful index, then this is very easy to use; e.g., perhaps a product ID column is being used as the index and we have two DataFrames containing product sales for different product ranges that we wish to join.

In our case, if working with the CO2 and CH4 data from above, in order to use join, we would first need to make an index using the year, month, and day columns. We can do this using the set_index method, like this:

co2_df_indexed = co2_df.set_index(['year', 'month', 'day'])
ch4_df_indexed = ch4_df.set_index(['year', 'month', 'day'])
print(co2_df_indexed)
print(ch4_df_indexed)
                co2
year month day     
2020 1     2    442
           3    444
           4    441
           7    446
                ch4
year month day     
2020 1     2    442
           3    444
           5    441
           6    442
           7    446

We can now perform the join. Again, as with merge, we can join using inner, outer, left or right.

combined_df = co2_df_indexed.join(ch4_df_indexed, how='outer')
print(combined_df)
                  co2    ch4
year month day              
2020 1     2    442.0  442.0
           3    444.0  444.0
           4    441.0    NaN
           5      NaN  441.0
           6      NaN  442.0
           7    446.0  446.0

Comparing merge and join, merge is more versatile because it can merge on any column, whereas join can only merge on the index. So you might wonder why we have a join method at all? The answer is that the join method works internally in a very different way than merge that exploits special properties of the DataFrame index. For large DataFrames, join is generally much faster than merge and so should be preferred if its use is possible.

Summary#

Pandas is a large and sophisticated package for working with data. In this tutorial, we have used it to read data, to deal with missing values, to conditionally filter data, to compute statistic on fields with and without prior grouping, and to combine datasets using merge and join. This is a good overview of the most commonly used functionality, but only scratches the surface of what Pandas can do. There are entire books dedicated to Pandas and extensive documentation available online: it is a very useful tool to have in your toolbox. We will be using it throughout the rest of the module and will introduce new functionality as required.