Design Principles of Distributed Systems: Dask and PySpark
• November 23, 2020
We were recently joined by Holden Karau, Spark maintainer and former Princess of the Covariance Matrix (long story), for a discussion on the design of Dask, how it compares to PySpark, and why these tradeoffs were chosen. For this conversation with Coiled’s CEO Matt Rocklin and Head of Data Science Evangelism and Marketing Hugo Bowne-Anderson, Holden brought her years of experience working on “big data” (but not representing the project or anyone else), as well as her more recent explorations in the Python-specific tooling beyond Spark (you can follow her along on YouTube and her blog).
You can catch the live stream replay below.
In this post, we cover:
- Holden’s journey learning Dask,
- Key trade-offs between Dask and PySpark,
- Dask’s approach to adaptive scaling, and
- Resource management and memory use.
Getting started with Dask
On learning Dask in public: “I want people to know that it’s normal for this stuff to be hard. It’s really fun, but it’s hard.”
We asked Holden to define Dask, and we loved her unique answer because she referred to what’s going on under the hood of Dask, rather than what it does for scaling things like pandas or NumPy code. As a Spark maintainer, we knew Holden would bring a new perspective to a Dask conversation.
Feeling slightly burnt out on Spark, Holden wanted to get back to the perspective of someone who is new to a tool and revisit the process of learning a new distributed system, which is often long forgotten by seasoned developers. Holden is learning Dask in public, and when Hugo asked what the motivation was behind that, she said, “I want people to know that it’s normal for this stuff to be hard. It’s really fun, but it’s hard.”
Dask vs. Spark
“It’s about the challenges of being a monolith vs. the challenges of being one of many packages, and there’s no clear solution”.
We jumped into some of Holden’s early experiences and roadblocks when learning Dask, and Matt gave his suggestions on solutions and insights into the world of Dask. A lot of the bumps Holden ran into provided good context for a larger conversation on the differences between Spark and Dask. On this, Matt noted, “It’s about the challenges of being a monolith vs. the challenges of being one of many packages, and there’s no clear solution”.
Then we covered local modes, and how Dask approaches this compared to Spark. Matt opened up the Dask documents and showed us a demo of how Dask does this, noting “There’s a thread pool sitting around and so Dask can by default just run on a thread pool…and what’s interesting is we started as a single-machine system first.” With Dask, local was the default mode, and distributed came later. Spark did the exact opposite, starting with distributed and then building out a local mode.
Adaptive scaling, resource management, and memory use
Dask has low-level telemetry tracking, which is the backbone of how Dask makes smart scheduling decisions.
After we chatted about her specific questions around her Dask process and the related architectural and design differences between Dask and Spark, Holden wanted to make sure we covered two things in particular: Dask’s adaptive scaling, and distributed resource management connected to memory use. This led to a focus on Dask’s low-level telemetry tracking, which is the backbone of how Dask makes smart scheduling decisions. We then talked about how Spark’s smarts are at a high level, with SQL query optimization, while Dask’s smarts are at a low-level with sophisticated task scheduling.
Lastly, we tackled resource management and memory use with Dask. Explaining the philosophy of Dask on this subject, Matt said, “A lot of Dask users are on resource management systems that have long queue times and so they would much rather Dask kill the worker and bring it back up again than the overbearing job manager kill the worker and then resubmit a new job and wait an hour for it to arrive in the queue”. Holden went on to compare Dask’s resource management process to that of Spark’s and noted that some of the Dask features Matt discussed were being introduced to Spark now.
A huge thank you to Holden for taking the time to speak with us! Be sure to follow along on her Dask journey, and to check out her new book.
For more on distributed systems and getting the most out of Dask, check out Coiled Cloud by signing up below for free.