COM6018 Data Science with Python

Week 8 - Classification Scikit-Learn

Jon Barker

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Overview

  • Review of Lab Class 6 (Landmine Detection)
  • Introducing some common Classifiers
  • Classification with Scikit-Learn
  • Preview of Lab Class 7 (Face Recognition)
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Review of Lab Class 6

  • You were provided with a landmine detection dataset.
  • You were asked to use Scikit-Learn to train and evaluate a kNN classifier.
  • We looked at tuning the value of k.
  • Compared a 5-class and 2-class version of the problem.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

The Solution Notebook

The solutions to the lab have been released.

Open the Solution Notebook

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Some Common Classifiers

We've so far been using a k-Nearest Neighbour classifier.

This week we will introduce some other common classifiers.

  • Support Vector Machines
  • Random Forests
  • Feed-Forward Neural Networks

We will then be using these in the lab class.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Classifier Comparison

Attribute/Use Case Random Forests Feed-Forward Neural Networks Support Vector Machines
Basic Concept Ensemble of decision trees that vote on the outcome. Layers of neurons with weighted connections. Find the hyperplane that best separates the classes.
Handling Non-Linear Data Good (due to ensemble method). Excellent (can model complex non-linear relationships). Excellent (with the use of kernel functions).
Parameter Tuning Relatively few parameters (e.g., number of trees, depth of tree). Many parameters (e.g., number of layers, neurons, epochs). Some critical parameters (e.g., C, gamma for RBF kernel).
Training Speed Fast to moderate (depends on the number of trees). Slow (requires backpropagation and many epochs). Moderate to slow (depends on the dataset size and choice of kernel).
Inference Speed Fast (simple voting/averaging of trees). Fast (once trained, forward pass is quick). Fast to moderate (depends on support vectors).
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Classifier Comparison (continued)

Attribute/Use Case Random Forests Feed-Forward Neural Networks Support Vector Machines
Memory Usage Moderate (stores many trees). High (stores weights for all connections). Moderate to high (stores support vectors and coefficients).
Robustness to Outliers Robust (outliers have less influence on an ensemble). Sensitive (outliers can significantly affect the loss). Robust (especially with appropriate kernel choice).
Ability to Handle Large Datasets Good (can handle large datasets well). Varies (large datasets require more computational resources). Poor to moderate (large datasets increase training time significantly).
Interpretability High (easy to understand how decisions are made). Low (often considered a black box). Medium (decision boundary is clear, but reasons are not).
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Classifier Use Cases

Attribute/Use Case Random Forests Feed-Forward Neural Networks Support Vector Machines
Use Case: Image Recognition Less common (not ideal for unstructured data). Very common (state-of-the-art results in many cases). Less common (not ideal for high dimensional data).
Use Case: Text Classification Common (works well with bag-of-words models). Common (especially with word embeddings). Common (especially with linear or non-linear kernels).
Use Case: Small Datasets Excellent (can achieve good performance). Poor (prone to overfitting without regularization). Excellent (effectiveness with small, clean datasets).
Use Case: Large-Scale Problems Good (scalable with bagging). Excellent (can be distributed and scaled with GPU acceleration). Poor to moderate (computational cost grows quickly).
Use Case: Structured Data Excellent (captures complex relationships between features). Good (with appropriate feature engineering). Excellent (particularly with kernel methods).
Use Case: Real-time Prediction Good (fast inference once the model is trained). Excellent (fast inference, particularly with optimized models). Good to moderate (depends on the number of support vectors).
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Classifier Choice Summary

Each method has its strengths and weaknesses

  • Random Forests: Great for a mix of numerical and categorical data, when interpretability is important, and when you need a quick and robust model that can handle outliers and non-linear data well.
  • Feed-Forward Neural Networks: Ideal for large-scale, complex problems like image and speech recognition where the model can benefit from large amounts of data and the capacity to capture intricate patterns through deep learning.
  • Support Vector Machines: Suitable for classification problems with clear margin of separation. They work well with small to medium-sized clean datasets and are effective for text classification and other domains where a good kernel trick can be applied.

The choice of algorithm often depends on the specific requirements of the application, including the size and type of data, the accuracy required, the training and inference speed, and the need for model interpretability.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Classification with Scikit-Learn

This week we will look at how to run experiments with classifiers using Scikit-Learn.

  • Swapping between different types of classifier.
  • Tools for tuning the hyperparameters of a classifier.
  • Using Scikit-Learn 'pipelines' to streamline the workflow.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Scikit-Learn Introductory Tutorial

Link to the Scikit-Learn tutorial

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Lab Class 7 Preview

We will use Scikit-Learn to recognise famous people from photographs.

  • We will use Scikit Learn's builtin data set, 'Labeled Faces in the Wild'
  • Details here, http://vis-www.cs.umass.edu/lfw/
  • It contains 13233 images of 5749 famous people.
  • Designed for a face verification task, but we will use a subset of it for a classification task.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Lab Class 7 Preview

Examples from the 'Labeled Faces in the Wild' dataset.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Lab Class 7 Preview

Link to lab class

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Next Steps

  • Prepare for Lab Class 7:
    • Read through the Classification with Scikit-Learn tutorial.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved