COM6018 Data Science with Python

Datasets vs Databases

Data scientists often talk about dataset; this should not be confused with a database.

A dataset is a structured collection of data, typically stored in a single file.
A database is a broader concept, typically a collection of datasets, stored in a database management system (DBMS).

Used for different purposes:

A database will typically be designed to allow efficient querying, i.e. recalling information, cross-referencing items, etc.
Data scientists are more often interested from learning something by using the whole dataset.

e.g. compare 'what was the weather yesterday?' with 'what will the weather be tomorrow?'

COM6018 Data Science with Python

Reading and Writing Data Files

Jon Barker

Overview

Datasets vs Databases

Structure of a Dataset

Using a dataset

Human-readable vs Machine-readable

Human-readable vs Machine-readable

Data Formats using in COM6018

CSV (Comma Separated Values)

Reading with csv.reader

Reading with csv.reader

Reading with csv.DictReader

Reading with csv.DictReader

Using the data

JSON (JavaScript Object Notation)

JSON example - the windfarm data

JSON example - climate data

Reading a JSON file

Writing to a JSON file

JSON supported data types

The problem with JSON

JSON lines (`.jsonl`) files

Reading JSON lines files

Writing JSON lines files

COM6018 Data Science with Python

Reading and Writing Data Files

Jon Barker

Overview

Datasets vs Databases

Structure of a Dataset

Using a dataset

Human-readable vs Machine-readable

Human-readable vs Machine-readable

Data Formats using in COM6018

CSV (Comma Separated Values)

Reading with csv.reader

Reading with csv.reader

Reading with csv.DictReader

Reading with csv.DictReader

Using the data

JSON (JavaScript Object Notation)

JSON example - the windfarm data

JSON example - climate data

Reading a JSON file

Writing to a JSON file

JSON supported data types

The problem with JSON

JSON lines (.jsonl) files

Reading JSON lines files

Writing JSON lines files

JSON lines (`.jsonl`) files