Dask Heartbeat by Coiled: September 2021

The Coiled Team September 17, 2021

,


Introduction

The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a monthly publication intended to centralize and broadcast Dask news over the previous month.  

If you want something added to this list, either send an email at info@coiled.io, or tweet and tag @dask_dev, and we’ll try to include it. Keep reading for the latest updates.

Dask Heartbeat

Performance of Parquet with Dask

Recently, the community noticed some challenges with processing large datasets stored in the Parquet file format on Amazon S3 buckets. These include issues around crossing the S3 request rate limits, difficulty parsing large metadata files, etc. James Bourbeau, Joris Van den Bossche, Martin Durant, and Richard (Rick) Zamora, are leading the effort to reproduce and fix these issues. You can follow the conversations here.

Dask + Google Summer of Code 2021

As a part of GSoC’21, Freyam Mehta improved task graph visualizations and extended the HTML representations for Dask high-level graphs. Task graphs now show the size of each layer and can be color-coded according to layer type. Dask also has new HTML representations in Jupyter Notebooks for the Process Interface and Security classes. Check out the project outcomes and more details in this blog post.

Optimizing High-Level Expressions

Aaron Meurer and Jim Crist-Harif are exploring how to optimize high-level queries and expressions. As Matthew Rocklin mentioned in the corresponding issue, the idea here is to understand user intent and rewrite their code to improve performance with Dask.

Dask DataFrame ORC API

Richard (Rick) Zamora worked on refactoring and expanding the Dask DataFrame ORC API. Dask DataFrame now has better support for reading and writing to ORC files, including support for working with partitioned datasets.

Update on HTML Representations

In the ongoing effort to improve and add more HTML Representations to Dask, Jacob Tomlinson migrated HTML representations to use jinja2 templates. This migration makes it easier to create and maintain these representations. It also makes this tooling more accessible to projects that are downstream of Dask.

Releases

Over the month of August, both Dask and Distributed versions 2021.08.0 and 2021.08.1 were released.

Dask Monthly Community Meeting 

Some highlights from the September Dask community meeting:

  • The NVIDIA team released a new package CUDA Python last month. It “provides Cython bindings and Python wrappers for the driver and runtime API for existing toolkits and libraries to simplify GPU-based accelerated processing.” ~ What is CUDA Python?

Full meeting notes are available here.

You’re All Caught Up On Dask

That’s it. Thanks for reading.

If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.

Try Coiled Cloud