Reduce memory usage with Dask dtypes

November 17, 2021

Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.

Using column types that require less memory can be a great way to speed up your workflows. Properly setting dtypes when reading files is sometimes needed for your code to run without error.

Here’s what’s covered in this post:

  • Inspecting DataFrame column types (dtypes)
  • Changing column types
  • Inferring types when reading CSVs
  • Manually specifying types
  • Problems with type inference
  • Using Parquet to avoid type inference

Inspecting DataFrame types

Let’s start by creating some DataFrames and viewing the dtypes.

Create a pandas DataFrame and print the dtypes.  All code snippets in this post are from this notebook.

//import pandas as pd

df = pd.DataFrame({"nums":[1, 2, 3, 4], "letters":["a", "b", "c", "d"]})

print(df.dtypes)

nums        int64
letters    object
dtype: object//]]>

The nums column has the int64 type and the letters column has the object type.

Let’s use this pandas DataFrame to create a Dask DataFrame and inspect the dtypes of the Dask DataFrame.

//import dask.dataframe as dd

ddf = dd.from_pandas(df, npartitions=2)

ddf.dtypes

nums        int64
letters    object
dtype: object//]]>

The Dask DataFrame has the same dtypes as the pandas DataFrame. 

Changing column types

Change the nums column to int8. You can use Dask's astype function to cast an object to a different type.

//ddf['nums'] = ddf['nums'].astype('int8')

ddf.dtypes

nums         int8
letters    object
dtype: object//]]>

8 bit integers don’t require as much memory as 64 bit integers, but can only hold values between -128 and 127, so this datatype is only suitable for columns that contain small numbers.

Selecting the best dtypes for your DataFrame is discussed in more detail later in this post.

Inferring types

Let’s read a 663 million row CSV time series dataset from a cloud filestore (S3) and print the dtypes.

We’ll use the Coiled Dask platform to execute these queries on a cluster.

//import coiled
import dask

cluster = coiled.Cluster(name="demo-cluster", n_workers=5)
client = dask.distributed.Client(cluster)

ddf = dd.read_csv(
   "s3://coiled-datasets/timeseries/20-years/csv/*.part",
   storage_options={"anon": True, 'use_ssl': True}
)

print(ddf.dtypes)

timestamp     object
id             int64
name          object
x            float64
y            float64
dtype: object//]]>

Dask infers the column types by taking a sample of the data. It doesn’t analyze every row of data when inferring types because that’d be really slow on a large dataset.

Dask may incorrectly infer types because it only uses a sample.  If the inferred type is wrong then subsequent computations will error out. We’ll discuss these errors and common work-arounds later in this post.

Manually specifying types

This section demonstrates how manually specifying types can reduce memory usage.

Let’s look at the memory usage of the DataFrame:

//ddf.memory_usage(deep=True).compute()

Index             140160
id            5298048000
name         41289103692
timestamp    50331456000
x             5298048000
y             5298048000
dtype: int64//]]>

The id column takes 5.3GB of memory and is typed as an int64. The id column contains values between 815 and 1,193. We don’t need 64 bit integers to store values that are so small. Let’s type id as a 16 bit integer and see how much memory that saves.

//ddf = dd.read_csv(
   "s3://coiled-datasets/timeseries/20-years/csv/*.part",
   storage_options={"anon": True, 'use_ssl': True},
   dtype={
     "id": "int16"
   }
)

ddf.memory_usage(deep=True).compute()

Index             140160
id            1324512000
name         41289103692
timestamp    50331456000
x             5298048000
y             5298048000
dtype: int64//]]>

The id column only takes up 1.3 GB of memory when it’s stored as a 16 bit integer.

It’s not surprising that the memory requirements are 4 times lower because int16 is 4x smaller than int64 ;)

Manually specifying column dtypes can be tedious, especially for wide DataFrames, but it’s less error prone than inferring types.

Problems when inferring types

As mentioned earlier, Dask does not look at every value when inferring types. It only looks at a sample of the rows.

Suppose a column has the following values: 14, 11, 10, 22, and “hi”. If Dask only samples 3 values (e.g. 11, 10, and 22), then it might incorrectly assume this is an int64 column.

Let’s illustrate this issue with the New York City Parking Tickets dataset.

//ddf = dd.read_csv(
   "s3://coiled-datasets/nyc-parking-tickets/csv/*.csv",
   storage_options={"anon": True, 'use_ssl': True})

len(ddf) # this works

ddf.head() # this errors out

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/d2/116lnkgd0l7f51xr7msb2jnh0000gn/T/ipykernel_9266/95726415.py in <module>
----> 1 ddf.head()

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------+---------+----------+
| Column                | Found   | Expected |
+-----------------------+---------+----------+
| House Number          | object  | float64  |
| Issuer Command        | object  | int64    |
| Issuer Squad          | object  | int64    |
| Time First Observed   | object  | float64  |
| Unregistered Vehicle? | float64 | int64    |
| Violation Description | object  | float64  |
| Violation Legal Code  | object  | float64  |
| Violation Location    | float64 | int64    |
| Violation Post Code   | object  | float64  |
+-----------------------+---------+----------+

The following columns also raised exceptions on conversion:

- House Number
 ValueError("could not convert string to float: '67-21'")
- Issuer Command
 ValueError("invalid literal for int() with base 10: 'T730'")
- Issuer Squad
 ValueError('cannot convert float NaN to integer')
- Time First Observed
 ValueError("could not convert string to float: '1134P'")
- Violation Description
 ValueError("could not convert string to float: 'BUS LANE VIOLATION'")
- Violation Legal Code
 ValueError("could not convert string to float: 'T'")
- Violation Post Code
 ValueError("could not convert string to float: 'H -'")

Usually this is due to Dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'House Number': 'object',
      'Issuer Command': 'object',
      'Issuer Squad': 'object',
      'Time First Observed': 'object',
      'Unregistered Vehicle?': 'float64',
      'Violation Description': 'object',
      'Violation Legal Code': 'object',
      'Violation Location': 'float64',
      'Violation Post Code': 'object'}

to the call to `read_csv`/`read_table`.//]]>

House Number was inferred to be a float64, but it should actually be an object because it contains values like “67-21”.

Let’s look at the full set of dtypes in this DataFrame with print(ddf.dtypes).

//Summons Number                         int64
Plate ID                              object
Registration State                    object
Plate Type                            object
Issue Date                            object
Violation Code                         int64
Vehicle Body Type                     object
Vehicle Make                          object
Issuing Agency                        object
Street Code1                           int64
Street Code2                           int64
Street Code3                           int64
Vehicle Expiration Date                int64
Violation Location                     int64
Violation Precinct                     int64
Issuer Precinct                        int64
Issuer Code                            int64
Issuer Command                         int64
Issuer Squad                           int64
Violation Time                        object
Time First Observed                  float64
Violation County                      object
Violation In Front Of Or Opposite     object
House Number                         float64
Street Name                           object
Intersecting Street                   object
Date First Observed                    int64
Law Section                            int64
Sub Division                          object
Violation Legal Code                 float64
Days Parking In Effect                object
From Hours In Effect                  object
To Hours In Effect                    object
Vehicle Color                         object
Unregistered Vehicle?                  int64
Vehicle Year                           int64
Meter Number                          object
Feet From Curb                         int64
Violation Post Code                  float64
Violation Description                float64
No Standing or Stopping Violation    float64
Hydrant Violation                    float64
Double Parking Violation             float64
Latitude                             float64
Longitude                            float64
Community Board                      float64
Community Council                    float64
Census Tract                         float64
BIN                                  float64
BBL                                  float64
NTA                                  float64
dtype: object//]]>

Figuring out the right dtypes for each column is really annoying.

We can get past this initial error by specifying the dtypes, as recommended in the error message.

//ddf = dd.read_csv(
   "s3://coiled-datasets/nyc-parking-tickets/csv/*.csv",
   dtype={'House Number': 'object',
      'Issuer Command': 'object',
      'Issuer Squad': 'object',
      'Time First Observed': 'object',
      'Unregistered Vehicle?': 'float64',
      'Violation Description': 'object',
      'Violation Legal Code': 'object',
      'Violation Location': 'float64',
      'Violation Post Code': 'object'}
)//]]>

ddf.head()will work now but ddf.describe().compute() will error out with this message:

//---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-------------------------+--------+----------+
| Column                  | Found  | Expected |
+-------------------------+--------+----------+
| Date First Observed     | object | int64    |
| Vehicle Expiration Date | object | int64    |
+-------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- Date First Observed
 ValueError("invalid literal for int() with base 10: '01/05/0001 12:00:00 PM'")
- Vehicle Expiration Date
 ValueError("invalid literal for int() with base 10: '01/01/20160411 12:00:00 PM'")

Usually this is due to Dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Date First Observed': 'object',
      'Vehicle Expiration Date': 'object'}

to the call to `read_csv`/`read_table`.//]]>

These dtype inference problems are common when using CSV files. This is one of the many reasons to avoid the CSV file format and use files better suited for data analyses.

Avoiding type inference

You should avoid type inference whenever possible.

Certain file formats, like Parquet, let you store file metadata in the footer. The Parquet metadata includes the schema, so you don’t need to manually specify the dtypes or rely on Dask to perform schema inference.

See this blog post on the Advantages of the Parquet format for more detail.

New string type

See this video for the string[pyarrow] datatype that was added to pandas 1.3.

The object datatype is notoriously memory hungry and this new datatype will help a lot!

Conclusion

Dask DataFrame columns are typed, just like pandas DataFrames.

You’ll sometimes run into type-inference errors, especially when you’re working with CSV files.

Use file formats that don’t require type inference so you don’t need to deal with this headache.

It’s great to see the PyData ecosystem constantly evolving and the pandas team adding a new dtype for strings that’s a lot more efficient. Constant improvements are part of what make the PyData ecosystem so wonderful!

Thanks for reading! If you’re interested in taking Coiled for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.

With GitHub, Google or email.

Use your AWS, GCP, or Azure account.

Start scaling.

$ pip install coiled
$ coiled setup
$ ipython
>>> import coiled
>>> cluster = coiled.Cluster(n_workers=500)