030 Introducing NumPy#

COM6018

Copyright © 2023, 2024 Jon Barker, University of Sheffield. All rights reserved.

In this lab class we are going to use NumPy to analyse the atmospheric gas concentration data that we saw in the previous lab. We will start by performing some simple tasks without using NumPy, and then we will see how NumPy can make our lives easier.

Step 1 - Reading the data#

Use ideas from the previous lab class to read the file data/co2e.csv into a variable called co2e_data that represents the file contents as a list of dictionaries. Make sure that the dictionary entries are of the correct type (i.e., the year entry should be an integer, and the co2_concentration entry should be a float).

Write your solution in the cell below and then run the test cell to check your solution.

# WRITE SOLUTION HERE
print(co2e_data[0])
# TEST
assert (co2e_data[0]) == {'year': 1987,
 'month': 4,
 'day': 3,
 'co2_concentration': 350.84,
 'ch4_concentration': 1.70002,
 'co2e': 393.34049999999996,
 'decimal_year': 1987.2554794520547}
print('All tests passed!')

Step 2 - Extracting the data#

We want to analyse the atmospheric gas concentrations as a sequence of values that change over time. The way that the data is currently stored does not make this very easy. We need to extract the data from each field in the list of dictionaries and store it in a more convenient form, such as a list. For example, we would like all of the co2_concentration values to be stored in a list called co2_concentrations.

Write a function called extract_data that takes the list of dictionaries and a field name as input and returns a list of values for that field. For example, extract_data(co2e_data, 'co2_concentration') should return a list of CO2 concentrations.

# WRITE SOLUTION HERE
co2_concentration = extract_data(co2e_data, 'co2_concentration')
print(co2_concentration)
# TEST
co2_concentration = extract_data(co2e_data, 'co2_concentration')
ch4_concentration = extract_data(co2e_data, 'ch4_concentration')
decimal_year = extract_data(co2e_data, 'decimal_year')

assert co2_concentration[0] == 350.84
assert ch4_concentration[0] == 1.70002
print('All tests passed!')

Step 3 - Smoothing the data#

Before we do anything with the data it is a good idea to visualise it. We can do this using matplotlib. For example, the following code will plot the CO2 concentrations against time:

import matplotlib.pyplot as plt
plt.plot(decimal_year, co2_concentration)

Copy this into the cell below and run it to see the plot.

# WRITE SOLUTION HERE

You should see a graph that trends upwards over time but which also has a season oscillation, i.e., the concentration of CO2 in the atmosphere increases and decreases over the course of a year. This is because the concentration of CO2 in the atmosphere is affected by the seasons. In the summer, plants grow and absorb CO2 from the atmosphere, and in the winter they die and release CO2 back into the atmosphere.

Let’s say that we now want to calculate the date and value of the maximum and minimum CO2 concentrations, i.e., the peaks and the dips in the graph. One way of finding a peak is to look for a value that is greater than the values either side of it. For example, if we have a list of values [1, 2, 3, 2, 1] then the value 3 is a peak because it is greater than the values either side of it. However, if we apply this logic to the data directly we will find that there are many peaks and dips that are not of interest, i.e., if we zoom in on one year we will see that the data is very noisy.

In the cell below, repeat the plot but only using the first 300 elements of decimal_year and co2_concentration. By looking at just this small segment of the data, you should be able to see these unwanted peaks and dips more clearly.

# WRITE SOLUTION HERE

The solution to this is to apply a ‘smoothing filter’ to the data before we look for the picks and dips. A smoothing filter is a function that takes a list of numbers as input and returns a new list of numbers that are ‘smoothed’ versions of the input. It does this by sliding a window over the input list and calculating the weighted sum of the numbers in the window.

For example, say that we have a list of numbers [1, 2, 3, 2, 1, 2, 3, 1] and we want to apply a smoothing filter with a window size of 3 with the window values [0.25, 0.5, 0.25]. The first smoothed output value will be calculated by taking the first three numbers [1, 2, 3] and computing the weighted sum 0.25*1 + 0.5*2 + 0.25*3 = 2. We then step forward one place and take the numbers [2, 3, 2] and compute the smoothed value as 0.25*2 + 0.5*3 + 0.25*2 = 2.25, and so on.

Note, the weights in the window must sum to 1.0. This is because we want the smoothed values to be on the same scale as the input values. If the weights do not sum to one we can simply normalise them by dividing each weight by the sum of the weights.

Write a function called smooth_series(series, window) that takes two parameters: series - a list of numbers to smooth, and window - a list of numbers to use as the weights. Remember to normalise the weights so that they sum to 1.0. The function should return a new list of smoothed values.

Write the function below and then check it by running the test cell.

# WRITE SOLUTION HERE
# TEST
series = [1, 2, 1, 2, 1]
assert smooth_series(series, [1]) == series
assert smooth_series(series, [1, 1]) == [1.5, 1.5, 1.5, 1.5]
print("All tests passed!")

We will now use the smoothing function to smooth the data. The function allows the window to be of different lengths. The longer the window is, the smoother the data will be. If we make the window too long, then we will lose the annual peaks and dips that we are interested in (this is called ‘over-smoothing’).

Test the smoothing by using window lengths of 7, 60, 360 and plot the result. Store the smoothed data in variables called co2_concentration_smooth7, co2_concentration_smooth_60 and co2_concentration_smooth360 respectively. Plot the smoothed data on the same graph as the original data to compare them.

# WRITE SOLUTION HERE

You should observe that the using a window of 7 is too short and the peaks and dips are still very noisy. A window of 360 is too long and the peaks and dips are no longer correct. The window of 60 is about right.

Step 4 - Finding the peaks#

We will now locate the peaks and dips in the smoothed data. We will look for peaks by looking for values that are greater than the values either side of them. We can also call these ‘local maxima’. We will write a function that takes a list of values and compares each value with its neighbours. If the value is greater than both of its neighbours then it is a peak and we will add the value to a list of peaks. We will also record the index of the peak in a list of peak indices. The function will return the list of peaks and the list of peak indices.

Write a function called find_peaks that takes a list of values as input and returns a list of peak values and a list of peak indices. Test your function by running the test cell.

# WRITE SOLUTION HERE
# TEST
peaks, indices = find_peaks([1,2,1,1,1,2,1,1,2,1])
assert len(peaks) == 3
assert len(indices) == 3
assert indices == [1, 5, 8]
print("All tests passed!")

Now, in the cell below, run the find_peaks function on the smoothed CO2 concentration data. How many peaks are there? Does there seem to be about one per year (remember, the data spans a 35 year period). Are the indices of the peaks regularly spaced by about 365 days?

# WRITE SOLUTION HERE

You will find that there are still too many peaks because the data smoothing is not perfect. Rather then increasing the smoothing window length, we can fix things by using a more robust definition of ‘peak’. We will say that a peak needs to be larger than all surrounding values up to a distance of 2 steps away (i.e., the 3 in 1, 2, 3, 2, 1 would be a peak, but the 3 in 4, 2, 3, 2, 1 would not).

Implement this new definition of peak in a function called find_peaks2 and test it by running the test cell.

# WRITE SOLUTION HERE
# TEST

peaks, indices = find_peaks2(co2_concentration_smooth60)
print(len(indices))
print(indices)

You’ll see that there are still too many peaks, i.e. more than one per year. We can quickly see what is going on by plotting points at the peak positions overlaid on the original data. Run the cell below to see this.

plt.plot(co2_concentration_smooth60)
plt.plot(indices, peaks, 'ro')
plt.show()

Observe the figure above. You should see several problems,

  • At least one of the genuine peaks has been missed.

  • One peak has been detected twice.

  • Strangely, there are many peaks being detected at the location of minima.

Peaks are located at the location of minima because the gradient of the curve is low at the minima, so a small amount of noise can raise a false peak, despite all the smoothing.

This exercise has illustrated how difficult it can be to extract something as simple as a peak from a noisy signal. The peaks are clear to the human eye, but it is difficult to write a program that can detect them reliably. We need more sophisticated statistical modelling approaches if we want to do this robustly.

You will have also noticed that we have had to write Python code for some very low level operations, such as smoothing and finding peaks. This is a lot of work and it is easy to make mistakes. Also, the code has involved using loops to process the values in the list. This is slow and inefficient. We will now see how NumPy can help us to solve these problems.

Step 5 - Finding the peaks with NumPy#

We will now repeat the exercise above using NumPy. We will start with the original data series co2_concentration and decimal_year.

Currently, the data are stored as a Python list. To turn them into NumPy arrays, we can use the NumPy.array function. For example, the following code will convert the list of co2_concentration values into a NumPy array called co2_concentration_array.

import numpy as np
co2_concentration = np.array(co2_concentrations)

Convert the co2_concentration and decimal_year lists into NumPy arrays and then plot the co2_concentration_array against the decimal_year_array. You should see the same graph as before.

# WRITE SOLUTION HERE
# TEST

assert type(co2_concentration) == np.ndarray
assert type(decimal_year) == np.ndarray
print("All tests passed!")

We will now rewrite the smooth_series function so that it takes the series and window parameters as NumPy arrays. The function should return a NumPy array of smoothed values. The function should work for any window size.

Note, you can use the fact the NumPy allows two vectors to be multiplied together in an element-wise fashion, e.g. x*y will multiply the first element of x by the first element of y, the second element of x by the second element of y, and so on. You can also make use of the numpy.sum function to sum the elements of an array. You will still need to pass the window over the series using a loop and use slicing notation to extract segments of the series to operate on.

# WRITE SOLUTION HERE
# TEST
series = np.array([1, 2, 1, 2, 1])
assert (smooth_series(series, np.array([1])) == series).all()
assert (smooth_series(series, np.array([1, 1])) == np.array([1.5, 1.5, 1.5, 1.5])).all()
print("All tests passed!")

The NumPy implementation will have been a little shorter and will run faster. However, we can do much better. The operation of sliding a window over a single and multiplying and summing is known as a ‘convolution’. NumPy has a function called numpy.convolve that will do this for us. The following code will apply a smoothing window to a series of values:

smoothed = np.convolve(series, window, mode='same')

This is actually better than our smoothing function because the mode='same' parameter ensures that the function correctly pads the single to make sure the output and input have the same number of elements.

Try using the convolve function to smooth the co2_concentration_array using a window of 90. Store the results in a variable called co2_concentration_smooth90. Plot the result and compare it to the previous result. You should see that the results are the same. Note that when plotting, you will need to ignore the first and last 45 samples because the zero padding will mean that these will not have valid values.

Note that you can use the np.ones function to make the window array. You will need to divide the window by the sum of the window values to make sure that the weights sum to 1.0 (the convolve function does not do this for you).

# WRITE SOLUTION HERE
# TEST
assert co2_concentration.shape == co2_concentration_smooth90.shape

You will have noticed that the smoothed function consistently underestimates the height of the peaks. This is because it is based on an average over 90 days, and so the values before and after the peaks pull the average down. You can improve the situation by using a window function that has bigger weights at the centre of the window and that tapers to zero at the edges. There are many window shapes that can be used (e.g., a triangle), but a common one in signal processing is the Hamming window. The values for an N-element Hamming window can be produced with NumPy using the following code:

window = np.hamming(N)

Try using the np.convolve function again but now use np.hamming to generate the window. Remember, you will need to divide the window values by their sum so that it sums to 1.0.

Store the new smoothed data in a variable, co2_concentration_smooth90_h.

To make the comparison, plot the co2_concentration data, and the 90-day smoothed versions using both the window of ones and the Hamming window. To see the difference more clearly, just look at the first 500 samples of the signals. You should see that the peaks are now better estimated.

# WRITE SOLUTION HERE

We will now rewrite the find_peaks2 function so that it takes a NumPy array as input and returns a NumPy array of peak values and a NumPy array of peak indices.

The function should be implemented without using any for-loops. Instead, you should use NumPy array operations to compare each element of the input array to its neighbours. There are many different ways to do this. The approach should be to build an array of logic values that has ‘True’ at the location of the peaks. You can then use the numpy.where function to extract the indices of the peaks and the values of the peaks.

You might find it useful to use the numpy.roll function to shift the array by one element. Once the array is shifted, you can compare the values in the original array to the shifted array. You can also effectively shift the array using slicing notation, e.g., consider the arrays series[4:], series[3:-1], series[2:-2], series[1:-3], series[:-4]. These all have the same length, but are shifted by different amounts.

You can then use the numpy.logical_and function to combine the results of the comparisons.

If you are very stuck then seek assistance from a demonstrator.

# WRITE SOLUTION HERE
# TEST

peaks, indices = find_peaks2(co2_concentration_smooth90_h)
plt.plot(decimal_year[45:-45],co2_concentration_smooth90_h[45:-45])
plt.plot(decimal_year[indices], peaks, 'ro')
plt.show()

Step 6 - Computing statistics#

We will now extend the analysis. We will first write a function to extract the date and value of the dips in the data. We will then compute some statistics on the peaks and dips: the average year-on-year increase in the peak and dip values, the average time between peaks and dips, and the average difference between the peak and dip values.

First, write a function that uses NumPy to find the dips in the data, which you can call find_dips. The function should take a NumPy array as input and return a NumPy array of dip values and a NumPy array of dip indices. It should be very similar to the find_peaks function.

# WRITE SOLUTION HERE
# TEST

dips, dip_indices = find_dips(co2_concentration_smooth90_h)
plt.plot(decimal_year[45:-45],co2_concentration_smooth90_h[45:-45])
plt.plot(decimal_year[indices], peaks, 'ro')
plt.plot(decimal_year[dip_indices], dips, 'go')
plt.show()

Now, we can write a function to compute the average year-on-year increase in the peak and dip values. The function should take a NumPy array of dip or peak values as input and return the average year-on-year increase. (You can assume that your list of dip and peak values has correctly identified the annual peak and dip.)

Call this function compute_average_difference and test it by running the test cell.

def compute_average_difference(series):
    return np.mean(np.diff(series))
# TEST
# Note, first and last values are not included in the average as they are likely to be incorrect
average_peak_difference = compute_average_difference(peaks[1:-1])
average_dip_difference = compute_average_difference(dips[1:-1])
print(average_peak_difference, average_dip_difference)

The average increases should be close to 2. i.e. the CO2 concentration is increasing by about 2 parts per million per year.

Now write short pieces of code to find answers to the following questions.

Q1. What is the average difference between the annual peak and dip values?#

# WRITE SOLUTION HERE

You should have got an answer of about 5 parts per million.

Q2. What is the average time (in days) between a peak and the next dip?#

# WRITE SOLUTION HERE

You should get an answer of about 140 days.

Q3. What is the average time (in days) between a dip and the next peak?#

# WRITE SOLUTION HERE

You should get an answer of about 225 days.

Note that the average peak-to-dip and average dip-to-peak sums to about 365 days, that is, one year, but the dips are not exactly half way between the peaks. This is because the co2 concentration is increasing over time.

Q4. On which day of the year on average do the peaks and dips occurs?#

To compute this, you need to consider just the fractional part of the decimal year value of the peak and dip locations. You can retrieve this using the np.mod function.


# WRITE SOLUTION HERE

I calculate this to be day 135 (peak) and day 273 (dip), i.e., the peaks occur in mid-May and the dips occur at the end of September.

If you are interested in the reason for this, see

https://keelingcurve.ucsd.edu/2013/06/04/why-does-atmospheric-co2-peak-in-may/

Summary#

This lab class has provided a very short introduction to NumPy covering some basic methods for manipulating simple 1-dimensional data series. Later in the module we will be using it in more mathematical contexts for performing more complex operations involving higher dimensional arrays, e.g. 2-D matrix data. However, we will see that the basic ideas remain the same. We will be leveraging NumPy’s ability to perform operations on arrays very efficiently without the need for for-loops. This will allow us to write code that is more concise and easier to understand and which runs much faster than the equivalent code using only native Python data structures and for-loops.