Dask Heartbeat by Coiled
• December 17, 2020
Introduction
The Dask community is highly distributed with different teams working independently. This is powerful but sometimes makes it hard for people within the community to see everything that is going on. The Dask Heartbeat by Coiled is a bi-weekly publication intended to centralize and broadcast Dask news over the previous two weeks.
If you want something added to this list either send an e-mail at info@coiled.io, or tweet and tag @dask_dev and we’ll try to include it.
Dask 2020.12.0 release
Dask and Distributed version 2020.12.0 was released last week. This release contains many updates (it’s the first release in two months). Some highlights include:
- Switching to CalVer for the versioning scheme. We plan to write more about the motivations for this next month. The previous release was version 2.30.1, while this version is 2020.12.0.
- The scheduler can now receive Dask
HighLevelGraph
s instead of raw dictionary task graphs. This allows for a much more efficient communication of task graphs from the client to the scheduler. This is currently off by default but is configurable for early adopters with theoptimization.fuse.active
config value. - Introduction of new
HighLevelGraph
layer objects includingBasicLayer
,Blockwise
,BlockwiseIO
,ShuffleLayer
, and more. - Added support for applying custom
Layer
-level annotations likepriority
,retries
, etc. with the newdask.annotate
context manager.
XGBoost 1.3.0 release
The newly released version 1.3.0 of XGBoost contains several updates that improve XGBoost + Dask integration. This is part of the larger effort to migrate the functionality of Dask-XGBoost into the mainline XGBoost codebase.
NVTabular is now built on Dask-CuDF
NVTabular, a library for processing tabular data needed to train and deploy recommender-systems models on GPUs, introduced a new Dask-CuDF backend to support scalable preprocessing. Rick Zamora (NVIDIA) outlines some recent NVTabular developments in this blogpost https://medium.com/rapids-ai/nvtabular-all-in-on-dask-6241b4e9ca19.
Dask-SQL Updates
Nils Braun (Bosch) shares his blogpost using SQL to drive Dask on Kubernetes
Also, in other fun Dask/Pandas/SQL news, we discover that the Dask-SQL project also magically works on Pandas.
This is one nice side effect of the close partnership between the two projects.
Stumpy 1.6.0 release
The STUMPY library for time series analysis improves its dask support in its recent release.
yt integration continues
Maintainers of the popular yt framework for computation and visualization of volumetric data are busy implementing Dask support. A slide deck on their recent progress is below
Quansight delivers Dask Webinar
Dhavide Aruliah (Quansight) https://twitter.com/quansightai/status/1334161550504968192
CZI EOSS Grantee Program
Ben Zaitlen (NVIDIA) presented Dask to OSS maintainers and Life Science practitioners at the Chan Zuckerberg Initiative’s Essential Open Source Software for Science gathering.
Video available here (Dask was on Day Three at the end)
We’re also glad to announce that Genevieve Buckley will be joining full time in February as the Dask Life Science fellow (generously funded by the CZI EOSS program). We’ll have a more detailed announcement next month, and are very excited. Genevieve will be the first employee of Dask itself as an organization, rather than one of the supporting companies.
Deploying Jupyter for Dask on ARM on Kubernetes
Holden Karau walks through how to deploy Jupyter Lab/Notebook on ARM on Kubernetes with Dask support in this blogpost https://scalingpythonml.com/2020/12/12/deploying-jupyter-lab-notebook-for-dask-on-arm-on-k8s.html
KNN blogpost by NVIDIA RAPIDS
Activity at annual AGU conference
The American Geophysical Union runs an annual conference. Dask took this community by storm a couple of years ago with the Pangeo project. This year is no different
For reference, CMIP is the Climate Model Intercomparison Project. It’s the standard multi-institutional model for climate change and one of the grander humanity-focused projects we see today.
There are many other happenings at this conference, including this announcement from the climpred project
2i2c is hiring
2i2c is looking to hire an open-source infrastructure engineer to work on cloud infrastructure for research and education using projects like JupyterHub and Dask. For more information, see their job posting at https://2i2c.org/job/osie-pangeo.
Micro-optimizing and refactoring the Distributed Scheduler
John Kirkham (NVIDIA) has continued making micro-optimization of the scheduler as part of a larger effort to boost performance:
- https://github.com/dask/distributed/pull/4358
- https://github.com/dask/distributed/pull/4355
- https://github.com/dask/distributed/pull/4351
- https://github.com/dask/distributed/pull/4344
- https://github.com/dask/distributed/pull/4348
- https://github.com/dask/distributed/pull/4341
- https://github.com/dask/distributed/pull/4342
- …
And has recently begun to decouple the state machine and networking communication parts of the scheduler.
Dask-Jobqueue 0.7.2 release
See https://jobqueue.dask.org/en/latest/changelog.html for the full list of changes.
Wrapping Up
That’s it. Thanks for reading all.
If you’re interested in taking Coiled Cloud for a spin, you can do so for free today when you click below.