Converting Dask DataFrames object columns with to_datetime

Sultan Orazbayev May 3, 2022

, ,

Introduction

This blog post explains how to convert Dask DataFrame object columns to datetime using Dask DataFrame’s to_datetime function. The to_datetime function is a useful tool in the data cleaning/processing toolkit; this function converts a string representation of a date (and time) into a datetime object. A datetime object opens up a broad range of new operations that can be performed, such as extracting the weekday from a given date, rounding time to the desired resolution, or converting the time across different time zones.

Let’s get started with a minimal example of a dataset with two columns: timestamp_start, which indicates the time at which a person started working and time_worked, which indicates how long the person has worked.

from io import StringIO

from dask.dataframe import from_pandas
from pandas import read_csv

data = StringIO(
    """timestamp_start,time_worked
    2021-01-01  9:25 AM,3 hours 12 minutes
    2021-02-03  4:25 PM,2 hours
    2021-03-05  1:25 PM,15 minutes
    2021-03-05 11:25 PM,55 minutes
    """
)
df = read_csv(data)
ddf = from_pandas(df, npartitions=2)

print(ddf.dtypes)
# timestamp_start    object
# time_worked        object
# dtype: object

Right now, the Dask DataFrame column types are objects, as inherited from the pandas dataframe.

timestamp_start is a string column, which makes it difficult to perform datetime operations. For example, if we would like to see all the rows that correspond to a Friday, then we would have to write a custom function to parse the date and determine which day of the week that date corresponds to. Converting the timestamp_start to a datetime object representation will make this operation much easier.

Let’s use to_datetime function to convert strings into datetime objects:

from dask.dataframe import to_datetime

ddf["converted_timestamp_start"] = to_datetime(ddf["timestamp_start"])

print(ddf.dtypes)
# timestamp_start                      object
# time_worked                          object
# converted_timestamp_start    datetime64[ns]
# dtype: object

Note that the converted_timestamp_start column is now in datetime64[ns] format. Many new functions/methods are now available via the .dt accessor. For example, calculating the day of the week can be done with .dt.dayofweek, which returns 0 for Monday, 1 for Tuesday, etc.

ddf["day_of_week"] = ddf["converted_timestamp_start"].dt.dayofweek

print(ddf[["converted_timestamp_start", "day_of_week"]].compute())
#   converted_timestamp_start  day_of_week
# 0       2021-01-01 09:25:00            4
# 1       2021-02-03 16:25:00            2
# 2       2021-03-05 13:25:00            4
# 3       2021-03-05 23:25:00            4

Let’s consider another task that would be difficult to achieve with a string representation of a date-time, but is simple when dealing with datetime objects. Suppose we are interested in calculating the timestamp that corresponds to the time at which the process has been completed. To achieve this we can convert the time_worked column into a Timedelta object, which expresses an absolute difference in times.

from pandas import Timedelta, to_timedelta

ddf["converted_time_worked"] = (
    ddf["time_worked"].apply(lambda x: to_timedelta(x), meta=Timedelta).compute()
)

print(ddf[["converted_timestamp_start", "converted_time_worked"]].compute())
#   converted_timestamp_start converted_time_worked
# 0       2021-01-01 09:25:00       0 days 03:12:00
# 1       2021-02-03 16:25:00       0 days 02:00:00
# 2       2021-03-05 13:25:00       0 days 00:15:00
# 3       2021-03-05 23:25:00       0 days 00:55:00

Now that we have converted both columns (converted_timestamp_start and converted_time_worked), calculating the end time of work can be expressed as a summation. Note that this operation correctly calculates the date of completion in the last row by carrying over the work onto the following day.

ddf["work_completed"] = ddf["converted_timestamp_start"] + ddf["converted_time_worked"]

print(
    ddf[
        ["converted_timestamp_start", "converted_time_worked", "work_completed"]
    ].compute()
)
#   converted_timestamp_start converted_time_worked      work_completed
# 0       2021-01-01 09:25:00       0 days 03:12:00 2021-01-01 12:37:00
# 1       2021-02-03 16:25:00       0 days 02:00:00 2021-02-03 18:25:00
# 2       2021-03-05 13:25:00       0 days 00:15:00 2021-03-05 13:40:00
# 3       2021-03-05 23:25:00       0 days 00:55:00 2021-03-06 00:20:00

To give another example: suppose for preparing a report, we wanted to round down the starting time to the 15 minute increments. This would be as easy as running:

ddf["converted_timestamp_start"].dt.floor('15 min').compute()
# 0   2021-01-01 09:15:00
# 1   2021-02-03 16:15:00
# 2   2021-03-05 13:15:00
# 3   2021-03-05 23:15:00
# Name: converted_timestamp_start, dtype: datetime64[ns]

There are many other useful functions/methods available via .dt, such as extracting specific parts of the datetime, changing resolution of the timedate, and many others, refer to the Dask docs.

Dask to_datetime: dealing with data that cannot be converted to datetime

In dealing with real-world data, there will often be errors/inconsistencies in the data. The Dask DataFrame function to_datetime has a useful keyword argument, errors, that allows some flexibility in handling errors. Let’s use a messier data for the next example:

from io import StringIO

from pandas import read_csv

data = StringIO(
    """timestamp_start,time_worked
    2021-01-01  9:25 AM,3 hours 12 minutes
    2021-02-03  4:25 PM,2 hours
    missing            ,15 minutes
    2021-03-05 11:?? PM,55 minutes
    """
)

df = read_csv(data)
ddf = from_pandas(df, npartitions=2)

print(ddf.dtypes)
# timestamp_start    object
# time_worked        object
# dtype: object

In the example above, the last two timestamp_start entries are incomplete, so running to_datetime on this column will raise an error: ParserError: Unknown string format: 2021-03-05 ?.

This is the default behavior of to_datetime – if a string cannot be converted to a datetime object, the function will raise an error. The function also allows a more flexible handling of errors, including coercion and ignoring.

Coercion in this case means that dask will try to convert the given object to a datetime object, and if it fails, it will return a NaT (not a time) entry, which represents missing values in datetime objects. To apply coercion, we will need to add keyword argument errors='coerce':

print(to_datetime(ddf["timestamp_start"], errors="coerce").compute())
# 0   2021-01-01 09:25:00
# 1   2021-02-03 14:25:00
# 2                   NaT
# 3                   NaT
# dtype: datetime64[ns]

Ignoring in this case means that dask will try to convert the object to datetime, and if it fails, then the original value will be retained. To apply this option, we will need to provide keyword argument errors="ignore":

print(to_datetime(ddf["timestamp_start"], errors="ignore").compute())
# 0        2021-01-01 09:25:00
# 1        2021-02-03 16:25:00
# 2        missing
# 3        2021-03-05 11:?? PM
# dtype: object

Note that if we ignore the errors, the advantages of the datetime column will disappear, since the .dt accessor methods can handle only the datetime objects and by ignoring the errors we retain the original string values of the data.

Dask to_datetime: dealing with different date formats

Another common scenario is the different datetime formats in the raw data.

from io import StringIO

from dask.dataframe import from_pandas, to_datetime
from pandas import read_csv

data = StringIO(
    """timestamp_start,time_worked
2021-01-01  9:25 AM,3 hours 12 minutes
"Thursday, October 9, 2022 14:25",2 hours
"January 12, 2022 14:25",15 minutes
    """
)

df = read_csv(data)
ddf = from_pandas(df, npartitions=2)

ddf["converted_timestamp_start"] = to_datetime(ddf["timestamp_start"])
print(ddf[["timestamp_start", "converted_timestamp_start"]].compute())
#                    timestamp_start converted_timestamp_start
# 0              2021-01-01  9:25 AM       2021-01-01 09:25:00
# 1  Thursday, October 9, 2022 14:25       2022-10-09 14:25:00
# 2           January 12, 2022 14:25       2022-01-12 14:25:00

The example above shows that to_datetime will handle many different formats out of the box.

For more fine-grained control it’s also possible to explicitly state the format of the data. Consider the example below, where the format of the date column has been changed, e.g. because of a particular logging application format. We can parse this data by specifying the keyword argument format, which contains the format of the datetime in the data using codes listed in Python docs.

from io import StringIO

from dask.dataframe import from_pandas, to_datetime
from pandas import read_csv

data = StringIO(
    """timestamp_start,time_worked
year 2021: 01/01 9:25 AM,3 hours 12 minutes
year 2021: 01/03 3:25 PM,2 hours
year 2021: 01/05 11:25 AM,2 hours
"""
)
df = read_csv(data)
ddf = from_pandas(df, npartitions=2)

ddf["converted_timestamp_start"] = to_datetime(
    ddf["timestamp_start"], format="year %Y: %m/%d %I:%M %p"
)
print(ddf[["timestamp_start", "converted_timestamp_start"]].compute())
#              timestamp_start converted_timestamp_start
# 0   year 2021: 01/01 9:25 AM       2021-01-01 09:25:00
# 1   year 2021: 01/03 3:25 PM       2021-01-03 15:25:00
# 2  year 2021: 01/05 11:25 AM       2021-01-05 11:25:00

Dask to_datetime: dealing with different time zones

Another challenge of working with datetime is converting data across different time zones. For example, suppose a financial analyst is working with a dataset of news reported at different geographical locations and they are interested in converting the local news time to another time zone, e.g. converting time to EST time zone to examine if there is some market activity at the time of news release.

In general this problem is difficult: time zones change over time in geography and magnitude of the time difference, so one location could be in different time zones in different years. In addition, there could be daylight savings time changes that affect some locations, but not others. Fortunately, most of these complex conversions can be handled with the .dt accessor.

from io import StringIO

from dask.dataframe import from_pandas, to_datetime
from pandas import read_csv

data = StringIO(
    """timestamp_local,location
2021-01-01 09:01:12,Asia/Almaty
2021-01-01 09:01:12,Europe/London
2021-01-01 09:01:12,America/New_York
"""
)
df = read_csv(data)
ddf = from_pandas(df, npartitions=2)

ddf["converted_date"] = to_datetime(ddf["timestamp_local"], utc=False)
print(ddf[["timestamp_local", "converted_date"]].compute())
#        timestamp_local      converted_date
# 0  2021-01-01 09:01:12 2021-01-01 09:01:12
# 1  2021-01-01 09:01:12 2021-01-01 09:01:12
# 2  2021-01-01 09:01:12 2021-01-01 09:01:12

Now that the timestamp has been converted into a datetime object, we can use the supporting methods tz_localize to make these datetime objects time zone-aware, i.e. adjust their value to include a reference to the correct time zone that corresponds to the value in the location column. After that we can specify the target time zone to which we would like to convert these timestamps, in our case this is “America/New_York”.

Let’s define a utility function that will help us achieving this:

def convert_tz(datetime_object, local_timezone):
    timezone_aware = datetime_object.tz_localize(local_timezone)
    timezone_est = timezone_aware.tz_convert("America/New_York")
    return timezone_est

ddf["converted_date_tz_aware"] = ddf[["converted_date", "location"]].apply(
    lambda row: convert_tz(row["converted_date"], row["location"]),
    axis=1,
    meta=("converted_date_tz_aware", "float"),
)

print(ddf[["location", "converted_date_tz_aware"]].compute())
#            location   converted_date_tz_aware
# 0       Asia/Almaty 2020-12-31 22:01:12-05:00
# 1     Europe/London 2021-01-01 04:01:12-05:00
# 2  America/New_York 2021-01-01 09:01:12-05:00

From the above we can see that the New York-specific time is correctly left without adjustment, while events that happened in London and Almaty are correctly converted to correspond to earlier time in the New York time zone.

Conclusion

When working with time-related data, such as dates and timestamps, it’s useful to convert the raw data to the datetime format using to_datetime. Other time-related data types have corresponding functions, e.g. timedelta objects can be created with to_timedelta. Having data in a datetime format makes it easy to perform datetime operations on the data, without the need to parse the string representations or to run custom calculations. This will save you time and enable you to get more value out of your data.


Ready to get started?

Create your first cluster in minutes.