Read JSON into Dask DataFrames
• May 9, 2022
This blog post explains how to read JSON into Dask DataFrames. It also explains the limitations of the JSON file format for big data analyses and alternatives that provide better performance.
Dask is a great technology for working with JSON data because it can read multiple files in parallel. Other technologies, like pandas, can only read one JSON file at a time which is comparatively slow.
Dask read JSON: simple examples
Let’s look at a few examples of how to read JSON files into Dask DataFrames. There are multiple different ways to format JSON data, so it’s good to see how Dask can handle JSON data that’s formatted differently.
Suppose you have a people1.json
file with the following data.
{ "data": [{"name":"John","age": 30.0,"car":"honda"},{"name":"Sally","age": 54.0,"car":"kia"}] }
Here’s how to read this JSON file into a Dask DataFrame.
import dask.dataframe as dd ddf = dd.read_json("people1.json", orient="split")
Here are the contents of the Dask DataFrame:
ddf.compute()

The orient="split"
argument tells Dask to split the values of each object into different columns.
JSON files are usually split into multiple lines so they’re easier to read. Let’s take a look at the same data with line breaks. Here are the contents of the people2.json
file.
{ "data": [{ "name": "John", "age": 30.0, "car": "honda" }, { "name": "Sally", "age": 54.0, "car": "kia" }] }
Here’s how to read people2.json
into a Dask DataFrame.
ddf = dd.read_json("people2.json", orient="split")
We can use the same syntax as earlier when reading a JSON file that uses line breaks.
JSON files are sometimes formatted with one object per row. Take a look at the people3.json
file.
{"name":"John","age": 30.0,"car":"honda"} {"name":"Sally","age": 54.0,"car":"kia"}
Here’s how to read people3.json into a Dask DataFrame.
ddf = dd.read_json("people3.json", lines=True)
When the JSON data contains one object per row, you must set the lines=True
argument.
Now let’s look at how to read a JSON file with a more complicated structure into a Dask DataFrame.
Dask read JSON: list
Let’s look at a JSON file that contains a list and see how to read it into a Dask DataFrame. Here are the contents of the students.json
file.
{ "data": [{ "name": "Li", "age": 15, "scores": [34, 99, 86] }, { "name": "Qu", "age": 18, "scores": [99, 100, 87] }] }
Here’s how to read students.json
into a Dask DataFrame.
ddf = dd.read_json("students.json", orient="split") ddf.compute()

You don’t need to do anything different when reading JSON data with lists into Dask DataFrames. Dask is intelligent enough to read the scores
list into a single column.
Here’s how to add an average_score
column to the DataFrame.
ddf["average_score"] = ddf["scores"].apply( lambda x: sum(x) / len(x), meta=("average_score", "float64") ) ddf.compute()

Dask makes it easy to work with JSON files that contain lists. Let’s look at nested JSON data, which is more complex.
Dask read JSON: nested data
Let’s look at a more complex JSON file with nested data. Here’s the contents of the students2.json
file.
{ "data": [{ "name": "george", "age": 16, "exam": { "subject": "geometry", "score": 56 } }, { "name": "nora", "age": 7, "exam": { "subject": "geometry", "score": 87 } }] }
Here’s how to read this JSON data into a Dask DataFrame and split out exam_subject
and exam_score
to separate columns.
ddf = dd.read_json("students2.json", orient="split") ddf["exam_subject"] = ddf.exam.apply(lambda x: x["subject"], meta=("exam_subject", "object")) ddf["exam_score"] = ddf.exam.apply(lambda x: x["score"], meta=("exam_score", "int64")) ddf.compute()

That code is relatively straightforward, but a bit tedious. Let’s look at a more complex example that’s handled programmatically.
Dask read JSON: flatten JSON
Let’s look at a more complicated JSON file with a list of exams and see how to load them neatly into a Dask DataFrame. Here’s the JSON data.
{ "data": [{ "name": "george", "age": 16, "exams": [{ "subject": "geometry", "score": 56 }, { "subject": "poetry", "score": 88 } ] }, { "name": "nora", "age": 7, "exams": [{ "subject": "geometry", "score": 87 }, { "subject": "poetry", "score": 94 } ] }] }
Let’s read the JSON data into a Dask DataFrame.
ddf = dd.read_json("students3.json", orient="split")
Now create a pandas function that’ll explode the exams into different rows and normalize the subject
and score
data into separate columns.
def pandas_fn(df): exploded = df.explode("exams") return pd.concat( [ exploded[["name", "age"]].reset_index(drop=True), pd.json_normalize(exploded["exams"]), ], axis=1, )
You can use map_partitions
to apply the pandas function to each partition in the DataFrame.
ddf.map_partitions( pandas_fn, meta=( ("name", "object"), ("age", "int64"), ("subject", "object"), ("score", "int64"), ), ).compute()

Dask read JSON: multiple files
Dask is designed to read multiple JSON files into a DataFrame in parallel.
Let’s create a directory with two JSON files and demonstrate how they can both be read into a Dask DataFrame.
Suppose you have the following json-data/pets1.json
file:
{ "data": [{ "name": "Triss", "species": "cat", "color": "orange" }, { "name": "Dale", "species": "dog", "color": "brown" }] } And this json-data/pets2.json file: { "data": [{ "name": "Gregg", "species": "bird", "color": "green" }, { "name": "Weston", "species": "wolf", "color": "gray" }] }
Read all of this data into a Dask DataFrame and print the results.
ddf = dd.read_json("./json-data/pets*.json", orient="split") ddf.compute() name species color 0 Triss cat orange 1 Dale dog brown 0 Gregg bird green 1 Weston wolf gray
You can use the wildcard operator (*) to read multiple JSON files into a Dask DataFrame.
Dask reads all the JSON files in parallel so the computation executes quickly. Dask will also run subsequent computations quickly because the data is partitioned and workloads are run on the partitions in parallel.
Dask read_json: Data stored in S3
Let’s provision a Dask cluster with Coiled and run a query on a 662 million row dataset that’s stored in S3.
Start by provisioning the Dask cluster.
import coiled import dask.dataframe as dd import dask cluster = coiled.Cluster(name="powers-demo", n_workers=10) client = dask.distributed.Client(cluster) Read in some JSON data to a Dask DataFrame. ddf = dd.read_json( "s3://coiled-datasets/timeseries/20-years/json/*.part", storage_options={"anon": True, "use_ssl": True}, lines=True, )
You can run ddf.head()
to see the first few rows of data.

Now compute the number of unique values in the name column.
ddf["name"].nunique().compute()
As you can see, Dask makes it really easy to read lots of JSON data and run analytical queries.
Conclusion
Dask can read and process large datasets. This blog post has a great real-world example that shows how to read 75GB of JSON data and convert it to Parquet.
JSON isn’t a great file format for big data analyses, so if you’re repeatedly querying the data, it’s best to use a file format that’s optimized for big data workflows, like Parquet.
That said, there are many large JSON datasets and sometimes you need to work with what you’re provided. Dask’s parallel processing capabilities make it more than capable for querying large JSON datasets.