COM6018 Data Science with Python

Reading and Writing Data Files

Jon Barker

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Overview

  • What is a dataset?
  • Data Formats used in Data Science
    • CSV and TSV files
    • JSON files
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Datasets vs Databases

Data scientists often talk about dataset; this should not be confused with a database.

  • A dataset is a structured collection of data, typically stored in a single file.
  • A database is a broader concept, typically a collection of datasets, stored in a database management system (DBMS).

Used for different purposes:

  • A database will typically be designed to allow efficient querying, i.e. recalling information, cross-referencing items, etc.
  • Data scientists are more often interested from learning something by using the whole dataset.

e.g. compare 'what was the weather yesterday?' with 'what will the weather be tomorrow?'

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Structure of a Dataset

All datasets have a common structure:

  • A dataset is a collection of records (or data items)
  • Each record is a collection of fields (also called attributes or features)

For example, consider a spreadsheet of student data:

  • Each row represents a student (a single record)
  • For every row, there are the same set of columns (fields), e.g., name, degree title, date of birth, etc.

Note, the fields often have simple types (e.g., a string) but could have more complex type (e.g., a list or a set).

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Using a dataset

  • The data we used will typically be provided to us as a dataset stored in a file (or files).
    • Note though, that the data may originally have come from some sort of sensor, or from human input, or from a larger database.
  • We will need to read the data from the file into our programs.
  • We will then need to process the data in some way, e.g., to extract information, to find patterns, to make predictions, etc.
  • We may then need to write the results of our processing back to a file.
    • Perhaps as a new dataset that will be used by another program or another group of people.
  • So, understanding how to load and save dataset is a key skill for a data scientist.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Human-readable vs Machine-readable

Human-readable formats are designed to be read by humans

  • Typically text based, e.g., ASCII or Unicode text files.
  • Can be loaded into a text editor or printed to a terminal.
  • Downside: not very compact, not very fast to load.
  • CSV, TSV, JSON, XML, YAML
  • Suitable for small or medium sized datasets.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Human-readable vs Machine-readable

Machine-readable formats are designed to be read by computers

  • Typically binary files, i.e. not text files, close to the internal representation of the data.
  • Often very compact, and fast to load.
  • Downside: not human readable, so hard to check/debug.
  • Examples: Avro, HDF5, Parquet
  • Suitable for large datasets.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Data Formats using in COM6018

In this module we will be primarily using two human-readable formats:

  • CSV (Comma Separated Values)
  • JSON (JavaScript Object Notation)

We may also encounter a few other formats for specific tasks, e.g. YAML for configuration files, specific binary formats for storing image data, etc.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

CSV (Comma Separated Values)

A simple text-based format for storing tabular data.

  • Each record is stored on a separate line.
  • Each field is separated by a comma.
  • The first line may contain the names of the fields.

Example, a CSV file containing information about wind farms:

"id", "turbines", "height", "power"
"WF1355", 13, 53, 19500
"WF1364", 3, 60, 8250
"WF1356", 12, 60, 24000
"WF1357", 36, 60, 72000

TSV (Tab Separated Values) is similar, but uses tabs instead of commas.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading with csv.reader

csv files cab be read with the reader function from the csv module.

import csv

# Open the file
with open('data/windfarm.csv') as csvfile:

    # Attach a reader to the file
    windfarm_reader = csv.reader(csvfile, skipinitialspace=True)

    # Read each row (as a list) and store them as a list of lists.
    data = [row for row in windfarm_reader]
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading with csv.reader

The code on the previous slide will return a list of lists

[
 ['id', 'turbines', 'height', 'power'],
 ['WF1355', '13', '53', '19500.0'],
 ['WF1364', '3', '60', '8250.0'],
 ['WF1356', '12', '60', '24000.0'],
 ['WF1357', '36', '60', '72000.0']
]

This can be difficult to work with, as the data is not in a convenient format.

There is a better approach...

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading with csv.DictReader

The same file can be read with the DictReader function from the csv module.


```python
import csv

with open('data/windfarm.csv') as csvfile:

    # Attach a DictReader to the file
    windfarm_reader = csv.DictReader(csvfile, skipinitialspace=True)

    # Read each row as a dictionary and store as a list of dictionaries.
    data = [row for row in windfarm_reader]
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading with csv.DictReader

The data will now be returned as a list of dictionaries,

[
    {
        "id": 1355,
        "turbines": 13,
        "height": 53,
        "power": 19500
    },
    {
        "id": 1364,
        "turbines": 3,
        "height": 60,
        "power": 8250
    },
    // etc
]
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Using the data

Working with a list of dictionaries is much easier than working with a list of lists.

For example, using a list of lists,

# Get the height of the first windfarm

data[0][2]  #  not very clear!!

vs list of dictionaries,

# Get the height of the first windfarm

data[0]['height']
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

JSON (JavaScript Object Notation)

JSON is a text-based format for storing data that can handle more complex data structures than CSV.

  • JSON is a subset of JavaScript, but is now used in many other languages.
  • Like Python, JSON uses curly braces to define dictionaries and square brackets to define lists.
  • In Data Science, JSON is often used to represent a dataset as a list of dictionaries.
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

JSON example - the windfarm data

[
    {
        "id": 1355,
        "turbines": 13,
        "height": 53,
        "power": 19500
    },
    {
        "id": 1364,
        "turbines": 3,
        "height": 60,
        "power": 8250
    }
]
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

JSON example - climate data

[
  {
    "city": "Amsterdam",
    "country": "Netherlands",
    "monthlyAvg": [
      {
        "high": 7,
        "low": 3,
        "snowDays": 4,
        "rainfall": 68
      },
      {
        "high": 6,
        "low": 3,
        "snowDays": 2,
        "rainfall": 47
      }
      // etc
    ]
  }
  // etc
]
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading a JSON file

The Python Standard Library has a module called json that can be used to read and write JSON files. It is super easy to use,

import json

# Open the file
with open('data/climate.json') as jsonfile:

    # Read the data
    climate_data = json.load(jsonfile)

We can now print the first entry in the data,

print(climate_data[0])
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Writing to a JSON file

Writing to a JSON file is equally easy,

# Open a file in 'write' mode...
with open('data/climate_copy.json', 'w') as jsonfile:

    # ...use the `dump` method to write the data to the file
    json.dump(climate_data, jsonfile)
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

JSON supported data types

JSON supports lists and dicts but it does not support all Python data types.

For example,

  • Cannot store Tuples, they are converted to lists
  • Cannot store Sets, they are converted to lists
  • Cannot store complex numbers, they are converted to strings
  • Dictionary keys must be strings

These constraints are rarely a problem in practice but can lead to some unexpected behaviour if care is not taken.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

The problem with JSON

  • Big problem: If storing a dataset as a list of dictionaries, the JSON file requires opening and closing square brackets.

  • The file cannot be correctly parsed until the whole file has been read, i.e., the closing square bracket has to be found.

  • This is not very convenient if the dataset is very large, or if the data is being streamed from a sensor.

  • Solution: The JSON lines format...

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

JSON lines (.jsonl) files

JSON lines files are a variant of JSON files where each line is a separate JSON object.

  • Each line is a valid JSON object
  • The file as a whole is not a valid JSON object
  • e.g., our windfarm data would look like this,
{"id": 1355, "turbines": 13, "height": 53, "power": 19500}
{"id": 1364, "turbines": 3, "height": 60, "power": 8250}

Note, each data item is on a separate line, but there is no enclosing square brackets, or separating commas.

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Reading JSON lines files

The json module does not support JSON lines files explicitly, but they can still be easily read and written with a few lines of code.

import json

# Open the file
with open('data/windfarm.jsonl') as jsonfile:

    # Read each line and store as a list of dictionaries.
    data = [json.loads(line) for line in jsonfile]

We are using the loads function to convert each line (stored as a string) into the data structure it represents (i.e., a dictionary).

Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python

Writing JSON lines files

Writing JSON lines files is also easy,

import json

# Open a file in 'write' mode...
with open('data/windfarm_copy.jsonl', 'w') as jsonfile:

    # iterate through the data...
    for item in data:

        # ...use the `dump` method to write each item to the file
        json.dump(item, jsonfile)

        # ...and add a newline character so it's on a separate line
        jsonfile.write('\n')
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved
COM6018 Data Science with Python
Copyright © Jon Barker, 2023, 2024 University of Sheffield. All rights reserved