Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.
Using column types that require less memory can be a great way to speed up your workflows. Properly setting dtypes when reading files is sometimes needed for your code to run without error.
Here’s what’s covered in this post:
Let’s start by creating some DataFrames and viewing the dtypes.
Create a pandas DataFrame and print the dtypes. All code snippets in this post are from this notebook.
//import pandas as pd
df = pd.DataFrame({"nums":[1, 2, 3, 4], "letters":["a", "b", "c", "d"]})
print(df.dtypes)
nums int64
letters object
dtype: object//]]>
The nums column has the int64 type and the letters column has the object type.
Let’s use this pandas DataFrame to create a Dask DataFrame and inspect the dtypes of the Dask DataFrame.
//import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
ddf.dtypes
nums int64
letters object
dtype: object//]]>
The Dask DataFrame has the same dtypes as the pandas DataFrame.
Change the nums column to int8. You can use Dask's astype function to cast an object to a different type.
//ddf['nums'] = ddf['nums'].astype('int8')
ddf.dtypes
nums int8
letters object
dtype: object//]]>
8 bit integers don’t require as much memory as 64 bit integers, but can only hold values between -128 and 127, so this datatype is only suitable for columns that contain small numbers.
Selecting the best dtypes for your DataFrame is discussed in more detail later in this post.
Let’s read a 663 million row CSV time series dataset from a cloud filestore (S3) and print the dtypes.
We’ll use the Coiled Dask platform to execute these queries on a cluster.
//import coiled
import dask
cluster = coiled.Cluster(name="demo-cluster", n_workers=5)
client = dask.distributed.Client(cluster)
ddf = dd.read_csv(
"s3://coiled-datasets/timeseries/20-years/csv/*.part",
storage_options={"anon": True, 'use_ssl': True}
)
print(ddf.dtypes)
timestamp object
id int64
name object
x float64
y float64
dtype: object//]]>
Dask infers the column types by taking a sample of the data. It doesn’t analyze every row of data when inferring types because that’d be really slow on a large dataset.
Dask may incorrectly infer types because it only uses a sample. If the inferred type is wrong then subsequent computations will error out. We’ll discuss these errors and common work-arounds later in this post.
This section demonstrates how manually specifying types can reduce memory usage.
Let’s look at the memory usage of the DataFrame:
//ddf.memory_usage(deep=True).compute()
Index 140160
id 5298048000
name 41289103692
timestamp 50331456000
x 5298048000
y 5298048000
dtype: int64//]]>
The id column takes 5.3GB of memory and is typed as an int64. The id column contains values between 815 and 1,193. We don’t need 64 bit integers to store values that are so small. Let’s type id as a 16 bit integer and see how much memory that saves.
//ddf = dd.read_csv(
"s3://coiled-datasets/timeseries/20-years/csv/*.part",
storage_options={"anon": True, 'use_ssl': True},
dtype={
"id": "int16"
}
)
ddf.memory_usage(deep=True).compute()
Index 140160
id 1324512000
name 41289103692
timestamp 50331456000
x 5298048000
y 5298048000
dtype: int64//]]>
The id column only takes up 1.3 GB of memory when it’s stored as a 16 bit integer.
It’s not surprising that the memory requirements are 4 times lower because int16 is 4x smaller than int64 ;)
Manually specifying column dtypes can be tedious, especially for wide DataFrames, but it’s less error prone than inferring types.
As mentioned earlier, Dask does not look at every value when inferring types. It only looks at a sample of the rows.
Suppose a column has the following values: 14, 11, 10, 22, and “hi”. If Dask only samples 3 values (e.g. 11, 10, and 22), then it might incorrectly assume this is an int64 column.
Let’s illustrate this issue with the New York City Parking Tickets dataset.
//ddf = dd.read_csv(
"s3://coiled-datasets/nyc-parking-tickets/csv/*.csv",
storage_options={"anon": True, 'use_ssl': True})
len(ddf) # this works
ddf.head() # this errors out
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/d2/116lnkgd0l7f51xr7msb2jnh0000gn/T/ipykernel_9266/95726415.py in <module>
----> 1 ddf.head()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------+---------+----------+
| Column | Found | Expected |
+-----------------------+---------+----------+
| House Number | object | float64 |
| Issuer Command | object | int64 |
| Issuer Squad | object | int64 |
| Time First Observed | object | float64 |
| Unregistered Vehicle? | float64 | int64 |
| Violation Description | object | float64 |
| Violation Legal Code | object | float64 |
| Violation Location | float64 | int64 |
| Violation Post Code | object | float64 |
+-----------------------+---------+----------+
The following columns also raised exceptions on conversion:
- House Number
ValueError("could not convert string to float: '67-21'")
- Issuer Command
ValueError("invalid literal for int() with base 10: 'T730'")
- Issuer Squad
ValueError('cannot convert float NaN to integer')
- Time First Observed
ValueError("could not convert string to float: '1134P'")
- Violation Description
ValueError("could not convert string to float: 'BUS LANE VIOLATION'")
- Violation Legal Code
ValueError("could not convert string to float: 'T'")
- Violation Post Code
ValueError("could not convert string to float: 'H -'")
Usually this is due to Dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'House Number': 'object',
'Issuer Command': 'object',
'Issuer Squad': 'object',
'Time First Observed': 'object',
'Unregistered Vehicle?': 'float64',
'Violation Description': 'object',
'Violation Legal Code': 'object',
'Violation Location': 'float64',
'Violation Post Code': 'object'}
to the call to `read_csv`/`read_table`.//]]>
House Number was inferred to be a float64, but it should actually be an object because it contains values like “67-21”.
Let’s look at the full set of dtypes in this DataFrame with print(ddf.dtypes).
//Summons Number int64
Plate ID object
Registration State object
Plate Type object
Issue Date object
Violation Code int64
Vehicle Body Type object
Vehicle Make object
Issuing Agency object
Street Code1 int64
Street Code2 int64
Street Code3 int64
Vehicle Expiration Date int64
Violation Location int64
Violation Precinct int64
Issuer Precinct int64
Issuer Code int64
Issuer Command int64
Issuer Squad int64
Violation Time object
Time First Observed float64
Violation County object
Violation In Front Of Or Opposite object
House Number float64
Street Name object
Intersecting Street object
Date First Observed int64
Law Section int64
Sub Division object
Violation Legal Code float64
Days Parking In Effect object
From Hours In Effect object
To Hours In Effect object
Vehicle Color object
Unregistered Vehicle? int64
Vehicle Year int64
Meter Number object
Feet From Curb int64
Violation Post Code float64
Violation Description float64
No Standing or Stopping Violation float64
Hydrant Violation float64
Double Parking Violation float64
Latitude float64
Longitude float64
Community Board float64
Community Council float64
Census Tract float64
BIN float64
BBL float64
NTA float64
dtype: object//]]>
Figuring out the right dtypes for each column is really annoying.
We can get past this initial error by specifying the dtypes, as recommended in the error message.
//ddf = dd.read_csv(
"s3://coiled-datasets/nyc-parking-tickets/csv/*.csv",
dtype={'House Number': 'object',
'Issuer Command': 'object',
'Issuer Squad': 'object',
'Time First Observed': 'object',
'Unregistered Vehicle?': 'float64',
'Violation Description': 'object',
'Violation Legal Code': 'object',
'Violation Location': 'float64',
'Violation Post Code': 'object'}
)//]]>
ddf.head()will work now but ddf.describe().compute() will error out with this message:
//---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-------------------------+--------+----------+
| Column | Found | Expected |
+-------------------------+--------+----------+
| Date First Observed | object | int64 |
| Vehicle Expiration Date | object | int64 |
+-------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- Date First Observed
ValueError("invalid literal for int() with base 10: '01/05/0001 12:00:00 PM'")
- Vehicle Expiration Date
ValueError("invalid literal for int() with base 10: '01/01/20160411 12:00:00 PM'")
Usually this is due to Dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Date First Observed': 'object',
'Vehicle Expiration Date': 'object'}
to the call to `read_csv`/`read_table`.//]]>
These dtype inference problems are common when using CSV files. This is one of the many reasons to avoid the CSV file format and use files better suited for data analyses.
You should avoid type inference whenever possible.
Certain file formats, like Parquet, let you store file metadata in the footer. The Parquet metadata includes the schema, so you don’t need to manually specify the dtypes or rely on Dask to perform schema inference.
See this blog post on the Advantages of the Parquet format for more detail.
See this video for the string[pyarrow] datatype that was added to pandas 1.3.
The object datatype is notoriously memory hungry and this new datatype will help a lot!
Dask DataFrame columns are typed, just like pandas DataFrames.
You’ll sometimes run into type-inference errors, especially when you’re working with CSV files.
Use file formats that don’t require type inference so you don’t need to deal with this headache.
It’s great to see the PyData ecosystem constantly evolving and the pandas team adding a new dtype for strings that’s a lot more efficient. Constant improvements are part of what make the PyData ecosystem so wonderful!
Thanks for reading! If you’re interested in taking Coiled for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.