080 Classification with scikit-learn#
COM6018
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
Introduction#
In this lab class we will be using scikit-learn to build a face classification system.
We will be using another of scikit-learn’s built-in datasets, the ‘Labeled faces in the Wild’ face recognition dataset. This dataset is a collection of pictures of famous people gathered from the internet. It contains 13233 images of 5749 famous people. For some people there are many examples (e.g., 530 images of George W. Bush). For other people there are only a few (e.g., 2 images of Donald Rumsfeld). The faces are all labeled with the person’s identity. More details can be found at http://vis-www.cs.umass.edu/lfw/.
We will be using a subset of the data, namely, the people who are represented by at least 50 images in the database. This will give us a dataset of 12 people and 1560 images in total. We will then treat this as a 12 class classification problem, i.e., given one of the images, which of the 12 people is it?
Step 1 - Loading and viewing the data#
To load the data run the cell below which loads the data and also resizes the images to 37 pixels wide by 50 pixels high. (Note, the first time this is run it may take a few minutes to complete. You can use this time to read the rest of this notebook.)
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=50, resize=0.4)
The returned data is stored in lfw_people which is a dictionary-like object with the following keys:
data- the list of feature vectors, one per image,images- the list of 37 by 50 pixel images,target- the list of labels for the images, i.e., the identity of the person in each image,target_names- the set of labels, i.e., the names of the 12 people in the images,DESCR- a string describing the dataset.
Use the cell below to print the DESCR field. Read the description and make sure you understand it.
# SOLUTION
print(lfw_people.DESCR)
.. _labeled_faces_in_the_wild_dataset:
The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------
This dataset is a collection of JPEG pictures of famous people collected
over the internet, and the details are available on the Kaggle website:
https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
Each picture is centered on a single face. The typical task is called
Face Verification: given a pair of two pictures, a binary classifier
must predict whether the two images are from the same person.
An alternative task, Face Recognition or Face Identification is:
given the picture of the face of an unknown person, identify the name
of the person by referring to a gallery of previously seen pictures of
identified persons.
Both Face Verification and Face Recognition are tasks that are typically
performed on the output of a model trained to perform Face Detection. The
most popular model for Face Detection is called Viola-Jones and is
implemented in the OpenCV library. The LFW faces were extracted by this
face detector from various online websites.
**Data Set Characteristics:**
================= =======================
Classes 5749
Samples total 13233
Dimensionality 5828
Features real, between 0 and 255
================= =======================
.. dropdown:: Usage
``scikit-learn`` provides two loaders that will automatically download,
cache, parse the metadata files, decode the jpeg and convert the
interesting slices into memmapped numpy arrays. This dataset size is more
than 200 MB. The first load typically takes more than a couple of minutes
to fully decode the relevant part of the JPEG files into numpy arrays. If
the dataset has been loaded once, the following times the loading times
less than 200ms by using a memmapped version memoized on the disk in the
``~/scikit_learn_data/lfw_home/`` folder using ``joblib``.
The first loader is used for the Face Identification task: a multi-class
classification task (hence supervised learning)::
>>> from sklearn.datasets import fetch_lfw_people
>>> lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
>>> for name in lfw_people.target_names:
... print(name)
...
Ariel Sharon
Colin Powell
Donald Rumsfeld
George W Bush
Gerhard Schroeder
Hugo Chavez
Tony Blair
The default slice is a rectangular shape around the face, removing
most of the background::
>>> lfw_people.data.dtype
dtype('float32')
>>> lfw_people.data.shape
(1288, 1850)
>>> lfw_people.images.shape
(1288, 50, 37)
Each of the ``1140`` faces is assigned to a single person id in the ``target``
array::
>>> lfw_people.target.shape
(1288,)
>>> list(lfw_people.target[:10])
[5, 6, 3, 1, 0, 1, 3, 4, 3, 0]
The second loader is typically used for the face verification task: each sample
is a pair of two picture belonging or not to the same person::
>>> from sklearn.datasets import fetch_lfw_pairs
>>> lfw_pairs_train = fetch_lfw_pairs(subset='train')
>>> list(lfw_pairs_train.target_names)
['Different persons', 'Same person']
>>> lfw_pairs_train.pairs.shape
(2200, 2, 62, 47)
>>> lfw_pairs_train.data.shape
(2200, 5828)
>>> lfw_pairs_train.target.shape
(2200,)
Both for the :func:`sklearn.datasets.fetch_lfw_people` and
:func:`sklearn.datasets.fetch_lfw_pairs` function it is
possible to get an additional dimension with the RGB color channels by
passing ``color=True``, in that case the shape will be
``(2200, 2, 62, 47, 3)``.
The :func:`sklearn.datasets.fetch_lfw_pairs` datasets is subdivided into
3 subsets: the development ``train`` set, the development ``test`` set and
an evaluation ``10_folds`` set meant to compute performance metrics using a
10-folds cross validation scheme.
.. rubric:: References
* `Labeled Faces in the Wild: A Database for Studying Face Recognition
in Unconstrained Environments.
<https://people.cs.umass.edu/~elm/papers/lfw.pdf>`_
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
.. rubric:: Examples
* :ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`
The images field contains the list of images, i.e. one image per sample, with each image stored as a 2-D NumPy array. We can display these using the imshow method of matplotlib.pyplot.
In the cell below, import `matplotlib.pyplot`` and use it to show the first 16 images in the dataset arranged as a 4 by 4 grid of images.
Hint: You can use plt.axis('off') to avoid displaying the axes and tick marks.
# SOLUTION
import matplotlib.pyplot as plt
for index in range(16):
plt.subplot(4, 4, index + 1)
plt.imshow(lfw_people.images[index], cmap='gray')
plt.axis('off')
Step 2 - Splitting the data into training and test sets#
Following the approaches used in the previous lab class and tutorial, split the data into training and test sets. Use 75% of the data for training and 25% for testing. Store the training data as the variable, X_train, and the test data as X_test. The training labels should be named y_train and the test labels, y_test.
Use random_state=0 to ensure that the data is split in the same way each time the code is run.
Write the code in the cell below and check that the training and test sets have the correct sizes by running the test cell.
# SOLUTION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lfw_people.data, lfw_people.target, random_state=0)
# TEST
print(X_test.shape)
assert X_train.shape == (1170, 1850)
assert X_test.shape == (390, 1850)
assert y_train.shape == (1170,)
assert y_test.shape == (390,)
print("All tests passed!")
(390, 1850)
All tests passed!
Step 3 - Using a KNN classifier#
We will start by using the KNN classifier that we introduced in the previous lab class.
In the cell below, import the KNeighborsClassifier from sklearn.neighbors and create a classifier that uses 1 neighbour. Train the classifier on the training data and labels. Then use the classifier to predict the labels for the test data. Call the predicted labels y_pred.
Score the classifier by calculating the accuracy on the test data and labels. Finally print the accuracy.
# SOLUTION
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
print(score)
0.5025641025641026
You should get an accuracy of about 50.2%.
Now, use the GridSearchCV class to find the best value of k for the KNN classifier. Test all odd values of k from 1 to 21. Use 5-fold cross validation.
# SOLUTION
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors':range(1,22,2)}
knn = KNeighborsClassifier()
model = GridSearchCV(knn, parameters, cv=5)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
{'n_neighbors': 11}
0.517948717948718
You should find that with a value of k=11 you get an accuracy of about 51.8%, i.e., a little better than when k=1.
However, this is still not very good. We can do better by using a more sophisticated classifier but we may also be able to improve performance by representing the images in a different way.
One of the problems here is that the feature vector has 1850 dimensions (i.e., a separate value for every pixel). This is a lot of dimensions for a KNN classifier to work with and it can result in a problem known as overfitting. In the next section we will reduce the number of features by using a dimensionality reduction technique to transform the feature vector into a lower dimensional space.
Step 4 - Applying a dimensionality reduction technique#
Our samples are described by the values of 1850 pixels. Many of these pixels are irrelevant to the task, e.g., pixels in the corners of the image which capture information about the background. Also, many of the pixels are highly correlated with each other, e.g., neighbouring pixels will have similar values, meaning that there is a lot of redundancy in the data. It is generally easier to train a classifier if we can describe our samples using a smaller number of features, and using features that are not highly correlated with each other.
Fortunately, there are very powerful and easy-to-implement approaches that can be used to learn a transform that can reduce a high-dimensional feature vector into a lower-dimensional one. These are called dimensionality reduction techniques. They are a very important part of machine learning and we will be looking at them in more detail later in the module. For now we will just use one of these techniques, called Principal Component Analysis (PCA) and see how it can be applied in scikit-learn.
Step 4.1 - Standardising the data#
Before using PCA we will first standardise the data. i.e., apply a scaling and offset to each feature so that all features have a mean of zero and a variance of one. This can be done using the StandardScaler class from sklearn.preprocessing module. See the example in the tutorial notes and write the necessary code below.
Note, you should fit the transform using the X_train data and apply the same transform to both the X_train and X_test data. Store the results as X_train_scaled and X_test_scaled respectively. You should use the fit_transform and transform methods.
# SOLUTION
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X_train_scaled = std_scaler.fit_transform(X_train)
X_test_scaled = std_scaler.transform(X_test)
Step 4.2 - Applying PCA#
The PCA technique learns a specific dimensionality reducing transform from the training data features (in scikit-learn this stage is called ‘fitting’). Once a transform has been learnt it can then be applied to the training and test data to produce the new lower-dimensioned feature vectors.
We will now perform the PCA fitting and transforming steps in the cell below. This can be done using the PCA class that can be imported from sklearn.decomposition. This can be used in a very similar way to the StandardScaler class, i.e., using methods called fit_transform and transform. The PCA class has a parameter called n_components that can be used to specify the number of features that we want for the output. Set this to 200.
Note, remember to use the X_train_scaled data to learn the PCA transform. Store the transformed data in the variable X_test_pca and X_train_pca.
Write the solution below and run the test cell to check that your data looks correct.
# SOLUTION
from sklearn.decomposition import PCA
pca = PCA(n_components=200)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# TEST
assert X_test_pca.shape == (390, 200)
assert X_train_pca.shape == (1170, 200)
print('All tests passed!')
All tests passed!
Step 4.3 - Evaluating the reduced feature vector#
We will now repeat the KNN classification but this time we will use the X_train_pca and X_test_pca feature vectors instead of the original X_train and X_test feature vectors. Note, the optimal value of k may now have changed, so rerun a grid search with the X_train_pca and using 5-fold cross-validation as before.
# SOLUTION
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors':range(1,22,2)}
knn = KNeighborsClassifier()
model = GridSearchCV(knn, parameters, cv=5)
model.fit(X_train_pca, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test_pca, y_test)
{'n_neighbors': 5}
0.5435897435897435
The result should now be slightly better than before. When I ran this I got a score of 54.6% correct with the optimum value of k being 5. This is a little better than the 51.8% that we got before using the original feature vectors.
Step 5 - Using a pipeline to combine feature extraction and classification#
In the previous steps, we have seen how to use scikit-learn to standardise data, perform dimensionality reduction, tune hyperparameters and then perform classification. The data pre-processing and classification steps were performed separately. This is not ideal as it can lead to errors. For example, we end up with multiple different versions of the data, e.g. X_train, X_train_scaled, X_train_pca, and similarly for the test data, so it would be very easy to get these variables confused. In this section we are going to use a construct called a ‘pipeline’ to combine the feature extraction and classification steps into a single stage.
Review the material in the tutorial notes to see how to use a pipeline. Then, create a pipeline that combines the StandardScaler transform, the PCA transform with the number of components set to 200 and the KNN classifier with the number of neighbours set to 5.
Store the pipeline as a variable called pipeline. Run the fit method with X_train on the pipeline and then the score method with X_test and y_test.
What is the accuracy? It should be the same as the accuracy that you got in the previous step. The pipeline does not change the computation, it just makes it easier to manage.
# SOLUTION
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=200)),
('knn', KNeighborsClassifier(n_neighbors=5)),
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(score)
0.5435897435897435
The pipeline can itself be used with the GridSearchCV class in order to tune its hyperparameters. Using GridSearchCV with a pipeline is just the same as using it with one of the built-in models. The only thing to note is that the parameter names have the form <step_name>__<parameter_name>. For example, if the KNeighborsClassifier step of the pipeline has been called knn, then to tune its n_neighbors parameter we would refer to the parameter as knn__n_neighbors.
Write code to tune the pipeline’s n_neighbors parameter and then evaluate the resulting model using the X_test data.
Note, if we search for values of k from 1 to 21 as we did before, then the GridSearch will take quite a long time to run (several minutes on my macbook pro). Just use k values of 1,3,5 and 7 in the grid search so that you don’t have to wait too long.
Write your solution below.
# SOLUTION
parameters = {'knn__n_neighbors':range(1,8,2)}
model = GridSearchCV(pipeline, parameters, cv=5)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
{'knn__n_neighbors': 5}
0.541025641025641
The reason that the GridSearch is slow is that it is re-running the entire pipeline for every value of k that it is testing. This means that the computationally expensive PCA fitting step is getting run many times. This is inefficient because the PCA step is part of the data pre-processing and will be the same for every value of k used by the classifier.
To avoid the recomputation, you can use sklearn’s ‘caching’ mechanism that is part of the Memory class from the joblib package. See the tutorial notes for an example of how to do this. Implement it below and now retry the search using odd values of k from 1 to 21 again.
# SOLUTION
from tempfile import mkdtemp
from joblib import Memory
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
cachedir = mkdtemp()
memory = Memory(cachedir, verbose=0)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=200)),
('knn', KNeighborsClassifier(n_neighbors=5)),
],memory=memory)
parameters = {'knn__n_neighbors':range(1,22,2)}
model = GridSearchCV(pipeline, parameters, cv=5)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
{'knn__n_neighbors': 5}
0.5435897435897435
Step 6 - Experimenting with different classifiers#
Now that we have a pipeline for training and evaluating classifiers, we can easily experiment with different classifiers. We will try the following classifiers:
Random Forests
Support Vector Machines (SVM)
Neural Network
For each one, we will use the techniques used previously to tune hyperparameters before evaluating the classifier on the test set.
Step 6.1 - Using a Random Forest Classifier#
We will now replace the k-nearest neighbour classifier with a random forest classifier. The pipeline should again start with the StandardScaler and PCA transforms but the classifier at the end of the pipeline will be changed. The random forest classifier is provided by the RandomForestClassifier class in the sklearn.ensemble module.
The random forest classifier has many hyperparameters but one of the most important is the number of trees in the forest, which is specified by the parameter n_estimators.
For the GridSearch, search over both the number of trees and the number of PCA components. Use values of 10, 50 and 100 for the number of trees. Use values of 20, 50 and 100 for the number of PCA components. Use a cache to speed up the search. Also, when constructing the GridSearchCV object, set the n_jobs parameter to -1. This instructs the GridSearchCV to use all available cores to perform the search. Depending on your computer, this may provide a significant speed up (on my Macbook Pro, using all the cores makes the processing about 8 times faster).
# SOLUTION
from tempfile import mkdtemp
from joblib import Memory
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
cachedir = mkdtemp()
memory = Memory(cachedir, verbose=0)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=200)),
('rf', RandomForestClassifier(n_estimators=100)),
],memory=memory)
parameters = {'pca__n_components': [20, 50, 100], 'rf__n_estimators': [10, 50, 100]}
model = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
{'pca__n_components': 50, 'rf__n_estimators': 50}
0.5846153846153846
What is the final accuracy? When I ran this, I got a score of 56.7%, i.e., significantly better than the best score achieved with the KNN classifier (54.6%).
Step 6.2 - Using a Support Vector Machine (SVM) Classifier#
We will now use a Support Vector Machine (SVM) classifier. Again this should require just a minor edit to the previous code. The pipeline should again start with the StandardScaler and PCA transforms but the classifier on the end will be changed. The SVM classifier is provided by the sklearn.svm.SVC class.
The key parameter for the SVC is the kernel, which is specified by the parameter kernel. The kernel is specified using a string and can be one of linear, poly, rbf or sigmoid. The SVC also has a parameter called C that controls the amount of regularisation. The default value of C is 1.0.
For the GridSearch, search over the kernel type, the C parameter and the number of PCA components. Use values of linear, poly, rbf and sigmoid for the kernel and values of 0.5, 1.0 and 2.0 for C. SVMs work well in high dimensional feature spaces so use values of 100, 200 and 500 for the number of PCA components. Note, we are now searching over 3 parameters with a total of 4 x 3 x 3 = 36 configurations being tested. Use a cache to speed up the search and set the n_jobs to -1 to use all the cores on your machine.
(The processing takes 10 seconds on my laptop. It might take longer on yours, so be patient.)
# SOLUTION
from tempfile import mkdtemp
from joblib import Memory
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
cachedir = mkdtemp()
memory = Memory(cachedir, verbose=0)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=200)),
('svc', SVC()),
],memory=memory)
parameters = {'pca__n_components': [100, 200, 500], 'svc__kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'svc__C': [0.5, 1, 2]}
model = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
{'pca__n_components': 500, 'svc__C': 0.5, 'svc__kernel': 'linear'}
0.8128205128205128
What score do you get? When I ran this I achieved an accuracy of 81.3% which is hugely better than the previous classifiers that had scores in the 50’s.
Step 6.3 - Using a Neural Network Classifier#
In this final task, we will use a neural network classifier to classify the data. We will use the MLPClassifier from sklearn.neural_network.
The MLPClassifier has many parameters that can be set, but the key parameters are the number of hidden layers and the number of neurons in each hidden layer. This is set using the hidden_layer_sizes parameter. For example, hidden_layer_sizes=(10, 10) will create a neural network with two hidden layers, each with 10 neurons. The default is hidden_layer_sizes=(100,), which creates a neural network with one hidden layer with 100 neurons.
For all other parameters we will keep the default values, except: we will set max_iter=1000 which will allow the network to train for more iterations; random_state=0 which will ensure that the results are reproducible; and early_stopping=True which will stop the training if the validation score does not improve for 10 iterations. Set these parameters in the MLPClassifier constructor when defining the pipeline.
For the GridSearchCV, we will search over PCA n_components values of 20, 50, 100, 200, 500, and we will try the following neural network architectures: (50,), (100,), (200,), (100, 100), (100, 100, 100) and (200, 100, 50). This will result in 5 x 6 = 30 different parameter combinations.
Remember to use a cache and to set n_jobs to -1 to speed things up.
# SOLUTION
from tempfile import mkdtemp
from joblib import Memory
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
cachedir = mkdtemp()
memory = Memory(cachedir, verbose=0)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('mlp', MLPClassifier(max_iter=1000, early_stopping=True, random_state=0))
],memory=memory)
parameters = {'pca__n_components': [20, 50, 100, 200, 500],
'mlp__hidden_layer_sizes': [(50,), (100,), (200,), (100, 100), (100, 100, 100), (200, 100, 50)]}
model = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1)
model.fit(X_train, y_train)
print(model.best_params_)
model = model.best_estimator_
model.score(X_test, y_test)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:788: UserWarning: Training interrupted by user.
warnings.warn("Training interrupted by user.")
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:788: UserWarning: Training interrupted by user.
warnings.warn("Training interrupted by user.")
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[17], line 24
20 parameters = {'pca__n_components': [20, 50, 100, 200, 500],
21 'mlp__hidden_layer_sizes': [(50,), (100,), (200,), (100, 100), (100, 100, 100), (200, 100, 50)]}
23 model = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1)
---> 24 model.fit(X_train, y_train)
25 print(model.best_params_)
27 model = model.best_estimator_
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/base.py:1365, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1358 estimator._validate_params()
1360 with config_context(
1361 skip_parameter_validation=(
1362 prefer_skip_nested_validation or global_skip_validation
1363 )
1364 ):
-> 1365 return fit_method(estimator, *args, **kwargs)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1051, in BaseSearchCV.fit(self, X, y, **params)
1045 results = self._format_results(
1046 all_candidate_params, n_splits, all_out, all_more_results
1047 )
1049 return results
-> 1051 self._run_search(evaluate_candidates)
1053 # multimetric is determined here because in the case of a callable
1054 # self.scoring the return type is only known after calling
1055 first_test_score = all_out[0]["test_scores"]
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1605, in GridSearchCV._run_search(self, evaluate_candidates)
1603 def _run_search(self, evaluate_candidates):
1604 """Search all candidates in param_grid"""
-> 1605 evaluate_candidates(ParameterGrid(self.param_grid))
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/model_selection/_search.py:997, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
989 if self.verbose > 0:
990 print(
991 "Fitting {0} folds for each of {1} candidates,"
992 " totalling {2} fits".format(
993 n_splits, n_candidates, n_candidates * n_splits
994 )
995 )
--> 997 out = parallel(
998 delayed(_fit_and_score)(
999 clone(base_estimator),
1000 X,
1001 y,
1002 train=train,
1003 test=test,
1004 parameters=parameters,
1005 split_progress=(split_idx, n_splits),
1006 candidate_progress=(cand_idx, n_candidates),
1007 **fit_and_score_kwargs,
1008 )
1009 for (cand_idx, parameters), (split_idx, (train, test)) in product(
1010 enumerate(candidate_params),
1011 enumerate(cv.split(X, y, **routed_params.splitter.split)),
1012 )
1013 )
1015 if len(out) < 1:
1016 raise ValueError(
1017 "No fits were performed. "
1018 "Was the CV iterator empty? "
1019 "Were there no candidates?"
1020 )
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/sklearn/utils/parallel.py:82, in Parallel.__call__(self, iterable)
73 warning_filters = warnings.filters
74 iterable_with_config_and_warning_filters = (
75 (
76 _with_config_and_warning_filters(delayed_func, config, warning_filters),
(...) 80 for delayed_func, args, kwargs in iterable
81 )
---> 82 return super().__call__(iterable_with_config_and_warning_filters)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/joblib/parallel.py:2072, in Parallel.__call__(self, iterable)
2066 # The first item from the output is blank, but it makes the interpreter
2067 # progress until it enters the Try/Except block of the generator and
2068 # reaches the first `yield` statement. This starts the asynchronous
2069 # dispatch of the tasks to the workers.
2070 next(output)
-> 2072 return output if self.return_generator else list(output)
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/joblib/parallel.py:1682, in Parallel._get_outputs(self, iterator, pre_dispatch)
1679 yield
1681 with self._backend.retrieval_context():
-> 1682 yield from self._retrieve()
1684 except GeneratorExit:
1685 # The generator has been garbage collected before being fully
1686 # consumed. This aborts the remaining tasks if possible and warn
1687 # the user if necessary.
1688 self._exception = True
File /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/joblib/parallel.py:1800, in Parallel._retrieve(self)
1789 if self.return_ordered:
1790 # Case ordered: wait for completion (or error) of the next job
1791 # that have been dispatched and not retrieved yet. If no job
(...) 1795 # control only have to be done on the amount of time the next
1796 # dispatched job is pending.
1797 if (nb_jobs == 0) or (
1798 self._jobs[0].get_status(timeout=self.timeout) == TASK_PENDING
1799 ):
-> 1800 time.sleep(0.01)
1801 continue
1803 elif nb_jobs == 0:
1804 # Case unordered: jobs are added to the list of jobs to
1805 # retrieve `self._jobs` only once completed or in error, which
(...) 1811 # timeouts before any other dispatched job has completed and
1812 # been added to `self._jobs` to be retrieved.
KeyboardInterrupt:
What performance do you get with the best model? When I ran this I got a score of 75.9% using an MLP with a single hidden layer with 200 neurons.
Step 7 - Analysing the classifier performance#
In this final step we will analyse the performance of the best classifier from the previous section.
You probably found that the Support Vector Machine (SVM) classifier performed best with a classification accuracy of over 80%. We will therefore use this classifier to analyse the performance.
We will first look at the confusion matrix to see which classes are most often confused with each other. We will then look at the precision, recall and F1 scores for each class.
Step 7.1 - Generating a Confusion matrix#
In the cell below, perform the following steps:
Make a pipeline for the feature standardisation, PCA and SVC classifier. Set the n_components of the PCA and the kernel type and C value of the SVM classifier to the best values you found in the previous section.
Call the pipeline’s
fitfunction using theX_trainandy_traindata.Call the
predictfunction of the pipeline using theX_testdata and store the result in a variable calledy_pred.
Then run the test cell to verify that your y_pred variable is a NumPy array with 390 elements.
# SOLUTION
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=500)),
('svc', SVC(kernel='linear', C=0.5)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# TEST
assert y_pred.shape == (390,)
print('All tests passed!')
To compute and display a confusion matrix for the above result, perform the following steps in the cell below:
import
ConfusionMatrixDisplayfromsklearn.metricscall
ConfusionMatrixDisplay.from_predictionspassingy_testandy_predas arguments.
Tips:
To make the confusion matrix easier to interpret you can set the
display_labelsargument of thefrom_predictionsmethod to be the list of the names of the classes, i.e. the people’s names. This list can be found inlfw_people.target_names.The names on the x-axis will be printed horizontally and will overlap. You can fix this by setting the
xticks_rotationargument to ‘vertical’.
# SOLUTION
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=lfw_people.target_names, xticks_rotation='vertical')
Study the confusion matrix and make sure that you understand what it means. Which people appear most often in the test data? Which people are most often misclassified? Which misclassifications are the most common? When I ran this I found some surprising results. For example, Serena Williams is twice mistaken for Ariel Sharon! This is probably not a misclassification that a human would make…
Step 7.2 - Per class precision and recall.#
We will now compute precision, recall and F1 scores for each class. If you have forgotten, then remind yourself what these scores mean by looking at the notes.
To compute these values use the precision_score, recall_score and f1_score functions from sklearn.metrics. Each of these function takes two arguments: the true labels and the predicted labels. By default they return a single value that is made by averaging over all the classes. To get the per class scores, set the average parameter to None.
Store the results in variables called precision, recall and f1.
Run the test cell to verify that your results have the correct format
# SOLUTION
from sklearn.metrics import f1_score, precision_score, recall_score
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
f1 = f1_score(y_test, y_pred, average=None)
# TEST
assert precision.shape == (12,)
assert recall.shape == (12,)
assert f1.shape == (12,)
print('All tests passed!')
To display the results in an easy-to-read format, we can store them in a pandas DataFrame. Run the code that is written for you below.
import numpy as np
# Store results in a pandas DataFrame
import pandas as pd
data = np.array([precision, recall, f1]).T # Store the results in the columns of a numpy array
df2 = pd.DataFrame(data, columns=['Precision', 'Recall', 'F1'], index=lfw_people.target_names)
df2.style # Display the DataFrame using a nice HTML table styling
Look at the results above. Which people have the highest F1 score and which have the lowest? What do you think determines whether a person is recognised correctly or not?
Summary#
This lab class has covered a lot of ground. We have looked at the following:
How to use scikit-learn to perform classification
How to perform feature normalisation and dimensionality reduction.
How to build pipelines that perform feature preprocessing and classification in a single stage.
How to use our pipeline in a grid search to find the best hyperparameters for our model.
How to analyse results using a confusion matrix and metrics such as precision, recall and F1 score.
Along the way we have also covered a few important details including how to use a cache and multiple cores to speed up the grid search.
With the techniques that you have covered in this lab you should now be able to apply scikit-learn to a wide range of classification problems. You are encouraged to study the solution code when it is released and to play around with the ideas and to read the documentation of the various functions that we have used. Many of the functions have advanced features that have not been covered in this lab class. As a challenge, see if you can find a classifier and hyper-parameter tuning that performs better than the 81% score that has been achieved in the solution code.
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.