If you want to run Python code at scale today you have several options. Arguably the two most popular today are Apache Spark, and Dask.
There are valid reasons to choose either of these tools. Historically, Spark performs better when you have workloads that consist of mostly SQL queries. Dask is implemented natively in Python and thus makes it trivial to use for Python developers. Dask offers significantly more flexibility than Spark, which makes it a great fit when you have to escape pre-defined SQL-like structures.
Let's look at why you would choose one over the other in more detail. We’ll cover a few dimensions you may consider:
We think that both projects are great, but we’re heavily biased towards Dask. Dask made pragmatic design choices that bring benefit to Python users. That said, we’ll try to keep things on-the-level for most of the article. We’ll occasionally include subsections where we express our opinions a bit more freely.
First, most parallel work is simple embarrassingly parallel work like this:
For this kind of work, both systems will be fine. This is like comparing brands of toothpaste or Coke vs Pepsi. There’s some nuance in the flavor, but they’re pretty much the same product.
There are no bad options.
Given that in most cases anything is fine, folks often choose what they’re most familiar with. Often this is governed by social cues or familiarity more than architectural or technology decisions. If your colleagues mostly use Apache projects then you’ll probably default to Spark. If you are in the Python data science community then you’ll probably choose Dask.
Community often matters more than technology. It gets you started and helps you along the way.
The surrounding software ecosystem also matters. Spark will have better support among enterprise data storage products for example. Dask will generally have better coverage in the chaotic bazaar of PyData projects. There is value in following your local crowd.
If you primarily run SQL queries, choose Apache Spark.
Spark SQL is better than Dask’s efforts here (despite fun and exciting developments in Dask to tackle this space). Spark is also more battle tested and produces reliably decent results, especially if you’re building a system for semi-literate programmers like SQL analysts.
Conversely, if you want to run generic Python code, Dask is much more flexible.
Spark is fundamentally a Map-Shuffle-Reduce paradigm. It’s much easier to use than Hadoop, but fundamentally the same programming abstraction under the hood. If you want to do something more complex, then Spark can not understand / express these computations. As an example, here is a credit risk model from a retail bank:
This is not the regular structure of Map-Shuffle-Reduce, so Spark couldn’t actually run this. You need something more flexible. Dask can do this stuff easily.
Databases are good at SQL. Dask and Python were designed to do stuff that databases can’t do. Our experience is that people choose Python because they need to break out of pre-defined abstractions like SQL or Spark. It’s been exciting seeing all of the new kinds of data and computations people do with Python and Dask, and the new science and technological benefits that arise.
Both systems can run at scale. Both systems can run on a laptop. Both systems has some configuration and setup pain.
If you’re in a corporate environment you should use a managed platform to do this for you and you should now compare Databricks/EMR/Synapse for Spark, Coiled/Saturn/Nebari for Dask.
If you’re on a personal computer they’re all still options, with varying degrees of automated setup.
OK, we think that we’re just way easier to use a single machine. Dask is super-lightweight.
Spark requires more expertise to configure well and brings along the JVM. Dask is just pure Python. It experiences a mild performance hit as a result (which most users never notice) but it’s trivial to use locally and easier to debug, both of which help with onboarding and accessibility.
In a perfect world this doesn’t matter. Unfortunately it sometimes does:
Both have excellent Python APIs and are primarily used by Python developers. Spark has to cross the JVM-Native barrier, which sometimes makes people sad. Dask is native (and so using native code is easier). Dask feels a little simpler to hack on for Python users.
In general, both projects can probably do what you want (unless you’re doing something very strange). They’ve each made design choices that have subtle impacts on different workflows. These design choices were determined by their origins:
Speaking with our Dask bias, we focused on Python user pain constantly. We’re pretty proud of all of the tradeoffs and decisions that have been made and will continue to be made.
Ten years ago “Parallel Python” was a fringe topic that no one took seriously. A lot has changed since then. It’s wonderful that today users can choose between several excellent choices and that many well-funded projects compete for attention.