Reduce memory usage with Dask dtypes
• November 17, 2021
Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.
Using column types that require less memory can be a great way to speed up your workflows. Properly setting dtypes when reading files is sometimes needed for your code to run without error.
Here’s what’s covered in this post:
- Inspecting DataFrame column types (dtypes)
- Changing column types
- Inferring types when reading CSVs
- Manually specifying types
- Problems with type inference
- Using Parquet to avoid type inference

Inspecting DataFrame types
Let’s start by creating some DataFrames and viewing the dtypes.
Create a pandas DataFrame and print the dtypes. All code snippets in this post are from this notebook.
import pandas as pd df = pd.DataFrame({"nums":[1, 2, 3, 4], "letters":["a", "b", "c", "d"]}) print(df.dtypes) nums int64 letters object dtype: object
The nums
column has the int64
type and the letters
column has the object
type.
Let’s use this pandas DataFrame to create a Dask DataFrame and inspect the dtypes of the Dask DataFrame.
import dask.dataframe as dd ddf = dd.from_pandas(df, npartitions=2) ddf.dtypes nums int64 letters object dtype: object
The Dask DataFrame has the same dtypes as the pandas DataFrame.
Changing column types
Change the nums
column to int8
. You can use Dask’s astype
function to cast an object to a different type.
ddf['nums'] = ddf['nums'].astype('int8') ddf.dtypes nums int8 letters object dtype: object
8 bit integers don’t require as much memory as 64 bit integers, but can only hold values between -128 and 127, so this datatype is only suitable for columns that contain small numbers.
Selecting the best dtypes for your DataFrame is discussed in more detail later in this post.
Inferring types
Let’s read a 663 million row CSV time series dataset from a cloud filestore (S3) and print the dtypes.
We’ll use the Coiled Dask platform to execute these queries on a cluster.
import coiled import dask cluster = coiled.Cluster(name="demo-cluster", n_workers=5) client = dask.distributed.Client(cluster) ddf = dd.read_csv( "s3://coiled-datasets/timeseries/20-years/csv/*.part", storage_options={"anon": True, 'use_ssl': True} ) print(ddf.dtypes) timestamp object id int64 name object x float64 y float64 dtype: object
Dask infers the column types by taking a sample of the data. It doesn’t analyze every row of data when inferring types because that’d be really slow on a large dataset.
Dask may incorrectly infer types because it only uses a sample. If the inferred type is wrong then subsequent computations will error out. We’ll discuss these errors and common work-arounds later in this post.
Manually specifying types
This section demonstrates how manually specifying types can reduce memory usage.
Let’s look at the memory usage of the DataFrame:
ddf.memory_usage(deep=True).compute() Index 140160 id 5298048000 name 41289103692 timestamp 50331456000 x 5298048000 y 5298048000 dtype: int64
The id
column takes 5.3GB of memory and is typed as an int64
. The id
column contains values between 815 and 1,193. We don’t need 64 bit integers to store values that are so small. Let’s type id
as a 16 bit integer and see how much memory that saves.
ddf = dd.read_csv( "s3://coiled-datasets/timeseries/20-years/csv/*.part", storage_options={"anon": True, 'use_ssl': True}, dtype={ "id": "int16" } ) ddf.memory_usage(deep=True).compute() Index 140160 id 1324512000 name 41289103692 timestamp 50331456000 x 5298048000 y 5298048000 dtype: int64
The id column only takes up 1.3 GB of memory when it’s stored as a 16 bit integer.
It’s not surprising that the memory requirements are 4 times lower because int16
is 4x smaller than int64
😉
Manually specifying column dtypes can be tedious, especially for wide DataFrames, but it’s less error prone than inferring types.
Problems when inferring types
As mentioned earlier, Dask does not look at every value when inferring types. It only looks at a sample of the rows.
Suppose a column has the following values: 14, 11, 10, 22, and “hi”. If Dask only samples 3 values (e.g. 11, 10, and 22), then it might incorrectly assume this is an int64
column.
Let’s illustrate this issue with the New York City Parking Tickets dataset.
ddf = dd.read_csv( "s3://coiled-datasets/nyc-parking-tickets/csv/*.csv", storage_options={"anon": True, 'use_ssl': True}) len(ddf) # this works ddf.head() # this errors out --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/d2/116lnkgd0l7f51xr7msb2jnh0000gn/T/ipykernel_9266/95726415.py in <module> ----> 1 ddf.head() ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +-----------------------+---------+----------+ | Column | Found | Expected | +-----------------------+---------+----------+ | House Number | object | float64 | | Issuer Command | object | int64 | | Issuer Squad | object | int64 | | Time First Observed | object | float64 | | Unregistered Vehicle? | float64 | int64 | | Violation Description | object | float64 | | Violation Legal Code | object | float64 | | Violation Location | float64 | int64 | | Violation Post Code | object | float64 | +-----------------------+---------+----------+ The following columns also raised exceptions on conversion: - House Number ValueError("could not convert string to float: '67-21'") - Issuer Command ValueError("invalid literal for int() with base 10: 'T730'") - Issuer Squad ValueError('cannot convert float NaN to integer') - Time First Observed ValueError("could not convert string to float: '1134P'") - Violation Description ValueError("could not convert string to float: 'BUS LANE VIOLATION'") - Violation Legal Code ValueError("could not convert string to float: 'T'") - Violation Post Code ValueError("could not convert string to float: 'H -'") Usually this is due to Dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'House Number': 'object', 'Issuer Command': 'object', 'Issuer Squad': 'object', 'Time First Observed': 'object', 'Unregistered Vehicle?': 'float64', 'Violation Description': 'object', 'Violation Legal Code': 'object', 'Violation Location': 'float64', 'Violation Post Code': 'object'} to the call to `read_csv`/`read_table`.
House Number
was inferred to be a float64
, but it should actually be an object
because it contains values like “67-21”.
Let’s look at the full set of dtypes in this DataFrame with print(ddf.dtypes)
.
Summons Number int64 Plate ID object Registration State object Plate Type object Issue Date object Violation Code int64 Vehicle Body Type object Vehicle Make object Issuing Agency object Street Code1 int64 Street Code2 int64 Street Code3 int64 Vehicle Expiration Date int64 Violation Location int64 Violation Precinct int64 Issuer Precinct int64 Issuer Code int64 Issuer Command int64 Issuer Squad int64 Violation Time object Time First Observed float64 Violation County object Violation In Front Of Or Opposite object House Number float64 Street Name object Intersecting Street object Date First Observed int64 Law Section int64 Sub Division object Violation Legal Code float64 Days Parking In Effect object From Hours In Effect object To Hours In Effect object Vehicle Color object Unregistered Vehicle? int64 Vehicle Year int64 Meter Number object Feet From Curb int64 Violation Post Code float64 Violation Description float64 No Standing or Stopping Violation float64 Hydrant Violation float64 Double Parking Violation float64 Latitude float64 Longitude float64 Community Board float64 Community Council float64 Census Tract float64 BIN float64 BBL float64 NTA float64 dtype: object
Figuring out the right dtypes for each column is really annoying.
We can get past this initial error by specifying the dtypes, as recommended in the error message.
ddf = dd.read_csv( "s3://coiled-datasets/nyc-parking-tickets/csv/*.csv", dtype={'House Number': 'object', 'Issuer Command': 'object', 'Issuer Squad': 'object', 'Time First Observed': 'object', 'Unregistered Vehicle?': 'float64', 'Violation Description': 'object', 'Violation Legal Code': 'object', 'Violation Location': 'float64', 'Violation Post Code': 'object'} )
ddf.head()
will work now but ddf.describe().compute()
will error out with this message:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +-------------------------+--------+----------+ | Column | Found | Expected | +-------------------------+--------+----------+ | Date First Observed | object | int64 | | Vehicle Expiration Date | object | int64 | +-------------------------+--------+----------+ The following columns also raised exceptions on conversion: - Date First Observed ValueError("invalid literal for int() with base 10: '01/05/0001 12:00:00 PM'") - Vehicle Expiration Date ValueError("invalid literal for int() with base 10: '01/01/20160411 12:00:00 PM'") Usually this is due to Dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'Date First Observed': 'object', 'Vehicle Expiration Date': 'object'} to the call to `read_csv`/`read_table`.
These dtype inference problems are common when using CSV files. This is one of the many reasons to avoid the CSV file format and use files better suited for data analyses.
Avoiding type inference
You should avoid type inference whenever possible.
Certain file formats, like Parquet, let you store file metadata in the footer. The Parquet metadata includes the schema, so you don’t need to manually specify the dtypes or rely on Dask to perform schema inference.
See this blog post on the Advantages of the Parquet format for more detail.
New string type
See this video for the string[pyarrow]
datatype that was added to pandas 1.3.
The object
datatype is notoriously memory hungry and this new datatype will help a lot!
Conclusion
Dask DataFrame columns are typed, just like pandas DataFrames.
You’ll sometimes run into type-inference errors, especially when you’re working with CSV files.
Use file formats that don’t require type inference so you don’t need to deal with this headache.
It’s great to see the PyData ecosystem constantly evolving and the pandas team adding a new dtype for strings that’s a lot more efficient. Constant improvements are part of what make the PyData ecosystem so wonderful!

Thanks for reading! If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.