COM6018 Data Science with Python

Week 11 - More on scikit-learn Pipelines

Jon Barker

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Overview

  • Evaluating classifiers
  • scikit-learn Pipelines - advanced examples
  • Assignment 2 - Q&A
  • Using LaTeX to write your report.
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Evaluating Classifiers

Tutorial 100 - Evaluating Classifier Performance

Link to tutorial

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Advanced scikit-learn Pipelines

Tutorial 110 - Further scikit-learn Pipelines

Link to tutorial

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Assignment 2 - Reminder

The system is a speech speed classification system that takes a speech recording and decides whether the speaker is speaking at a 'normal' speed or has been modified to be 'very slow', 'slow', 'fast' or 'very fast'.

You have been provided with:

  • A training dataset.
  • An evaluation dataset.
  • A baseline system that uses a 1-nearest neighbour classifier.

You need to build your own system. Performance will be evaluated on a hidden test set.

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Assignment 2 - Reminder

There are two separate variants of the task:

  • Speed classification - speech modified by changing the playback speed.
  • Tempo classification - speech modified by simulating slower or faster speaking rates.

You are asked to build separate classifiers for each task.

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Rules

  • Your model files should be named 'model.speed.joblib' and 'model.tempo.joblib' and must not exceed 80 MB in size.
  • You can only train your model using the provided training data (augmentation is allowed).
  • You cannot use any pre-trained models.
  • You may only use the standard Python libraries and the following: numpy, matplotlib, seaborn, pandas, scikit-learn, joblib, Pillow (for image processing).
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Report Structure

Your report should be no more than two pages (two sides) and should include the following sections:

  • Abstract
  • Introduction
  • System Description
  • Experiments
  • Results and Analysis
  • Conclusions
  • References
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Report Structure

  • Introduction: A brief description of the speech modification classification problem.
  • System Description: A complete description of your classification system pipeline, highlighting critical hyperparameters to optimise.
  • Experiments: A description of the experiments you conducted to tune your system's hyperparameters, including the construction of your training dataset. Be explicit about which hyperparameters you experimented with and how they influenced the performance of your model. Include results that justify your final choices.
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Report Structure (cont.)

  • Results and Analysis: Provide an analysis of the performance of your final system on the evaluation data, including a comparison to the baseline system. Include a table that reports the accuracy of your model. Additionally, provide a brief discussion of any observed trends or insights, highlighting factors that may have influenced the results.
  • Conclusions: A summary of your work, including suggestions for further improvements.
  • References: A list of any references cited in your report.
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Final Submission

You will need to submit the following files:

  • report.pdf - a PDF report that describes your system and experiments.
  • train_speed.py and train_tempo.py - The Python scripts that train your classifiers. Include clear comments explaining the training process and key steps.
  • model.speed.joblib and model.tempo.joblib - the joblib files containing your trained models.

The assignment is due by 15:00 on Wednesday, 17th December. Standard lateness penalties will be applied.

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Assessment

The final mark will be based on the following criteria:

  • The quality and clarity of your code (20/60)
  • The quality and clarity of the written report (30/60)
  • The performance of your classifier on a hidden evaluation dataset (10/60)
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Using LaTeX to write your report

  • The assignment contains a template for the report in LaTeX.

  • You can upload this to Overleaf and edit it online.

Link to Overleaf

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Bits and Bobs

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Persisting Models in scikit-learn

  • Once trained a model can be saved to disk and then loaded again later.

  • We can use this to distribute a trained model to other people.

  • Two approaches to saving a model in scikit-learn:

    • pickle
    • joblib
  • The preferred approach is Joblib.

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Saving and loading a model with pickle

pickle is a standard Python library for saving and loading Python objects to disk.

If we have a model called model we can save it to disk with,

import pickle

pickle.dump(model, open('model.pkl', 'wb'))

Then we can load the model later with,

import pickle

# Read the pickle file from disk
model = pickle.load(open('model.pkl', 'rb'))
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Saving and loading a model with joblib

If we have a model called model we can save it to disk with,

import joblib

joblib.dump(model, 'model.joblib')

Then we can load the model later with,

import joblib

model = joblib.load('model.joblib')

# or using a file handle
model = joblib.load(open('model.joblib', 'rb'))

Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.
COM6018 Data Science with Python

Next Steps

  • Lab Class on Tuesday:
    • No new material - but come with questions about Assignment 2 or anything else covered in the module
  • Next week is a Reading Week
  • Assignment is due next Wednesday
Copyright © 2023–2025 Jon Barker, University of Sheffield. All rights reserved.