Read JSON into Dask DataFrames

Matt Powers May 9, 2022

,


This blog post explains how to read JSON into Dask DataFrames.  It also explains the limitations of the JSON file format for big data analyses and alternatives that provide better performance.

Dask is a great technology for working with JSON data because it can read multiple files in parallel.  Other technologies, like pandas, can only read one JSON file at a time which is comparatively slow.

Dask read JSON: simple examples

Let’s look at a few examples of how to read JSON files into Dask DataFrames.  There are multiple different ways to format JSON data, so it’s good to see how Dask can handle JSON data that’s formatted differently.

Suppose you have a people1.json file with the following data.

{ "data": [{"name":"John","age": 30.0,"car":"honda"},{"name":"Sally","age": 54.0,"car":"kia"}] }

Here’s how to read this JSON file into a Dask DataFrame.

import dask.dataframe as dd

ddf = dd.read_json("people1.json", orient="split")

Here are the contents of the Dask DataFrame:

ddf.compute()

The orient="split" argument tells Dask to split the values of each object into different columns.

JSON files are usually split into multiple lines so they’re easier to read.  Let’s take a look at the same data with line breaks.  Here are the contents of the people2.json file.

{
	"data": [{
		"name": "John",
		"age": 30.0,
		"car": "honda"
	}, {
		"name": "Sally",
		"age": 54.0,
		"car": "kia"
	}]
}

Here’s how to read people2.json into a Dask DataFrame.

ddf = dd.read_json("people2.json", orient="split")

We can use the same syntax as earlier when reading a JSON file that uses line breaks.
JSON files are sometimes formatted with one object per row.  Take a look at the people3.json file.

{"name":"John","age": 30.0,"car":"honda"}
{"name":"Sally","age": 54.0,"car":"kia"}

Here’s how to read people3.json into a Dask DataFrame.

ddf = dd.read_json("people3.json", lines=True)

When the JSON data contains one object per row, you must set the lines=True argument.

Now let’s look at how to read a JSON file with a more complicated structure into a Dask DataFrame.

Dask read JSON: list

Let’s look at a JSON file that contains a list and see how to read it into a Dask DataFrame.  Here are the contents of the students.json file.

{
	"data": [{
		"name": "Li",
		"age": 15,
		"scores": [34, 99, 86]
	}, {
		"name": "Qu",
		"age": 18,
		"scores": [99, 100, 87]
	}]
}

Here’s how to read students.json into a Dask DataFrame.

ddf = dd.read_json("students.json", orient="split")

ddf.compute()

You don’t need to do anything different when reading JSON data with lists into Dask DataFrames.  Dask is intelligent enough to read the scores list into a single column.

Here’s how to add an average_score column to the DataFrame.

ddf["average_score"] = ddf["scores"].apply(
    lambda x: sum(x) / len(x), meta=("average_score", "float64")
)

ddf.compute()

Dask makes it easy to work with JSON files that contain lists.  Let’s look at nested JSON data, which is more complex.

Dask read JSON: nested data

Let’s look at a more complex JSON file with nested data.  Here’s the contents of the students2.json file.

{
	"data": [{
		"name": "george",
		"age": 16,
		"exam": {
			"subject": "geometry",
			"score": 56
		}

	}, {
		"name": "nora",
		"age": 7,
		"exam": {
			"subject": "geometry",
			"score": 87
		}

	}]
}

Here’s how to read this JSON data into a Dask DataFrame and split out exam_subject and exam_score to separate columns.

ddf = dd.read_json("students2.json", orient="split")

ddf["exam_subject"] = ddf.exam.apply(lambda x: x["subject"], meta=("exam_subject", "object"))

ddf["exam_score"] = ddf.exam.apply(lambda x: x["score"], meta=("exam_score", "int64"))

ddf.compute()

That code is relatively straightforward, but a bit tedious.  Let’s look at a more complex example that’s handled programmatically.

Dask read JSON: flatten JSON

Let’s look at a more complicated JSON file with a list of exams and see how to load them neatly into a Dask DataFrame.  Here’s the JSON data.

{
	"data": [{
		"name": "george",
		"age": 16,
		"exams": [{
				"subject": "geometry",
				"score": 56
			},
			{
				"subject": "poetry",
				"score": 88
			}
		]

	}, {
		"name": "nora",
		"age": 7,
		"exams": [{
				"subject": "geometry",
				"score": 87
			},
			{
				"subject": "poetry",
				"score": 94
			}
		]
	}]
}

Let’s read the JSON data into a Dask DataFrame.

ddf = dd.read_json("students3.json", orient="split")

Now create a pandas function that’ll explode the exams into different rows and normalize the subject and score data into separate columns.

def pandas_fn(df):
    exploded = df.explode("exams")
    return pd.concat(
        [
            exploded[["name", "age"]].reset_index(drop=True),
            pd.json_normalize(exploded["exams"]),
        ],
        axis=1,
    )

You can use map_partitions to apply the pandas function to each partition in the DataFrame.

ddf.map_partitions(
    pandas_fn,
    meta=(
        ("name", "object"),
        ("age", "int64"),
        ("subject", "object"),
        ("score", "int64"),
    ),
).compute()

Dask read JSON: multiple files

Dask is designed to read multiple JSON files into a DataFrame in parallel.

Let’s create a directory with two JSON files and demonstrate how they can both be read into a Dask DataFrame.

Suppose you have the following json-data/pets1.json file:

{
	"data": [{
		"name": "Triss",
		"species": "cat",
		"color": "orange"
	}, {
		"name": "Dale",
		"species": "dog",
		"color": "brown"
	}]
}

And this json-data/pets2.json file:

{
	"data": [{
		"name": "Gregg",
		"species": "bird",
		"color": "green"
	}, {
		"name": "Weston",
		"species": "wolf",
		"color": "gray"
	}]
}

Read all of this data into a Dask DataFrame and print the results.

ddf = dd.read_json("./json-data/pets*.json", orient="split")

ddf.compute()

    name species   color
0   Triss     cat  orange
1    Dale     dog   brown
0   Gregg    bird   green
1  Weston    wolf    gray

You can use the wildcard operator (*) to read multiple JSON files into a Dask DataFrame.

Dask reads all the JSON files in parallel so the computation executes quickly.  Dask will also run subsequent computations quickly because the data is partitioned and workloads are run on the partitions in parallel.

Dask read_json: Data stored in S3

Let’s provision a Dask cluster with Coiled and run a query on a 662 million row dataset that’s stored in S3.

Start by provisioning the Dask cluster.

import coiled
import dask.dataframe as dd
import dask

cluster = coiled.Cluster(name="powers-demo", n_workers=10)

client = dask.distributed.Client(cluster)

Read in some JSON data to a Dask DataFrame.

ddf = dd.read_json(
    "s3://coiled-datasets/timeseries/20-years/json/*.part",
    storage_options={"anon": True, "use_ssl": True},
    lines=True,
)

You can run ddf.head() to see the first few rows of data.

Now compute the number of unique values in the name column.

ddf["name"].nunique().compute()

As you can see, Dask makes it really easy to read lots of JSON data and run analytical queries.

Conclusion

Dask can read and process large datasets.  This blog post has a great real-world example that shows how to read 75GB of JSON data and convert it to Parquet.

JSON isn’t a great file format for big data analyses, so if you’re repeatedly querying the data, it’s best to use a file format that’s optimized for big data workflows, like Parquet.

That said, there are many large JSON datasets and sometimes you need to work with what you’re provided.  Dask’s parallel processing capabilities make it more than capable for querying large JSON datasets.


Ready to get started?

Create your first cluster in minutes.