Eric Dill, Director of Data Science platform at DTN, joined us recently for Coiled’s first Science Thursday of 2021. Hugo Bowne-Anderson, Head of Marketing and Evangelism at Coiled, and James Bourbeau, Lead OSS Engineer at Coiled, talked to Eric about the current scalable computing landscape, why Spark is still popular, and where Dask fits in the broader ecosystem. Eric revisits his PyData NYC 2019 talk on the same topic and revamps it for 2021.
In this post, we will answer:
- Is Spark still relevant? And, why?
- How scalable computing relates to the business world?
- What does today’s scalable data science landscape look like?
- How do Dask and Coiled fit in this landscape?
You can check out the livestream replay here:
Is Spark still Relevant?
According to Eric, the answer is yes:
“Of course Spark is still relevant, because it’s everywhere. Everybody is still using it. There are lots of people doing lots of things with it and selling lots of products that are powered by it.”
The interesting question here is: Why are enterprises still choosing Spark?
Most data scientists clearly prefer Pythonic frameworks over Java-based Spark. In fact, the data science space has a diverse range of people, some of whom may not have a software engineering background to work with Java and Spark.
Dask didn’t have many companies backing it in 2019, which led to people continuing to choose Spark.
These questions answered, the interesting topic now is the current scalable data science landscape and the infrastructure around it.
What is Scalable Computing?
It’s about larger workloads, larger models, complex systems, and much more
Scalable computing is anything that isn’t a single thread on a single core, and it expands rapidly from there: local multi-core, to local single-GPU, to local multi-GPU, to distributed CPU, to even distributed GPU!
There is a common misconception that scalable computing is about larger and larger datasets. In reality, it’s about larger workloads, larger models, complex systems, and much more. Another way to think about scalable computing is memory-bound-problems vs compute-bound problems, as Tom Augspurger describes in his talk on Scalable Machine Learning with Dask.
Spark is still relevant because it’s not a risky choice. As Eric says:
“Spark is the new IBM. You’re never going to get fired for choosing Spark/IBM.”
In the context of using Dask or RAPIDS, most businesses are going to require enterprise support contracts, professional services engagements and enterprise-ready products. Most businesses are not prepared to bring in cutting-edge open source projects and manage them entirely in-house. In 2019, the lack of Dask companies to solve this gap was Eric’s thesis as to why Spark is still relevant. New companies, like Coiled, are stepping in to fill this void thus making Dask more enterprise-ready and making it a better and better choice compared to Spark.
Broader Scalable Computing Landscape
The discussion moves towards the broader scalable data science landscape, and where tools like Dask and Spark fit in. Eric explain this with the chart shown below:
In enterprise data science, there are roughly 5 axes – data, development environment, scalable computing, workflow managers, and dashboarding. There are also more niche problems like model management, packaging, CI/CD, etc. Infrastructure also needs to be considered for every element in the above image: Are you on-prem, on the cloud (AWS, GCP, Azure, etc.), both? And, there’s also the consideration of how your workflows for development and production break down. In many cases, this whole chart is duplicated for production with different ACLs on different pieces. In short, it’s supremely complex.
Additionally, you have the build-vs-buy choice for most of these elements — further complicating matters. Each of these axes (and more that we haven’t even discussed) have specialized software and companies behind them. To maximize investment in the data science teams, you’ll need to provide a holistic solution, simultaneously solving for all the axes that are important to your business problems.
Dask and Coiled in this Landscape
Dask is only one component of this complex system and Coiled is building products in the scalable computing axis of this landscape, an axis that is becoming increasingly important. Coiled is trying to meet data scientists where they are — in the Python Data Science Ecosystem. A major pain point today is data scientists being taken out of their element because of vendor lock-in — asking Pythonistas to jump into Spark notebooks is challenging, especially if they need the latest version of libraries like tensorflow.
Matthew Rocklin notes:
“Coiled today tries to focus as much as possible on scalable compute, in such a way that it can augment the other components of this chart without friction. Building products that own exactly one component of this chart and enable the others without friction is actually really hard :)”
Eric is skeptical about any platform that tries to incorporate all the axes. He likes Coiled’s positioning here because it’s not trying to do everything.
We feel that best-of-breed products are going to become increasingly important compared to all-in-one platforms. We’re excited about interoperability between these best-of-breed products, as we write in O’Reilly Radar. Coiled, much like Dask, interops with different axes in this landscape. For instance, Coiled (and Dask) can interact with workflow managers like Airflow and Prefect, with S3 in the data column, and with environments like Jupyter Notebooks.
Comparing Ecosystems: Dask vs RAPIDS vs Spark
- Spark is distributed DataFrames in the JVM on the CPU
- Dask is distributed DataFrames in Python on the CPU
- RAPIDS is distributed DataFrames in Dask (more mature) or Spark on the GPU
During the live stream, Dask commented:
“I am more than just DataFrames!”
We agree. Dask makes it very easy to scale NumPy, pandas, and scikit-learn, but it’s much more than that. It’s a complete toolbox for distributed computing and building distributed applications.
Eric’s perspective now isn’t Dask vs Spark vs RAPIDS, but SQL vs Dask vs Spark, and RAPIDS can power it all.
The StackOverflow developer survey from 2019 and 2020 shows that Python, Pandas, and hence Dask is trending up, SQL is trending up, whereas Scala, Spark, and Hadoop are trending down. People are already looking for what’s next after Spark.
Another interesting way to look at this landscape is using interfaces: SQL or DataFrame. The following slide explains this distinction:
Ecosystem summary for January, 2021
Dask’s maturity story has changed dramatically since 2019.
This slide is a feature comparison. As RAPIDS is an acceleration engine, we’re interested in comparing Spark and Dask.
Some points to note:
- Spark DataFrame API got Koalas to make it more accessible to Pythonistas.
- Dask’s maturity story has changed dramatically since 2019. There are at least five companies offering products and services around Dask, including Coiled!
- Job scheduling for Dask has made huge strides and Dask-Kubernetes has come a long way. (Coiled is also actively working on the Kubernetes story, so keep an eye out for that!)
- Dask’s local scalability has also improved, predominantly because of RAPIDS.
- Dask-SQL is a big change since 2019, it’s now a new project!
Dask also taps into the Python visualization ecosystem shown below, which is extremely powerful. It can interoperate with visualization and dashboarding libraries, which is a huge win because you can scale out of a notebook and into a dashboard very easily.
The future of OSS institutional adoption
Eric is excited about bringing businesses into the data science revolution. The tooling, the infrastructure, and the knowledge are there, so he is interested in using these resources to solve business problems. He says:
“You need to teach product teams how to think about data science and you need to teach data scientists how to talk to business folks. The next step is transforming the whole business to speak data science.”
We’re going through a phase transition. Earlier, businesses were full of individuals using OSS tooling, now we’re entering a phase where there is institutional adoption of OSS tooling. It isn’t clear how to support institutions doing this at scale, and Coiled is excited to be part of this next phase.
If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments, you can do so for free today when you click below.