And how this has and has not worked out
Distributed computing is hard because of configuration consistency. Hosted Jupyter notebooks make it easy for the group that controls the distributed computing environment to also control the user’s development environment. By removing the user’s ability to control their environment we ensure consistency and a predictable experience; this is the good news.
The bad news is that we now own the user’s development environment and the user doesn’t. Users have strong opinions and high expectations about development environments, which are hard to meet, especially in the cloud.
At Coiled we encourage users to use Coiled from their own dev environment. This decision goes against current trends. This post discusses some of the history and motivation around this decision, our approach, and some lessons learned about where this has worked well and where it has proved problematic.
History and Motivation 1: Spark, Databricks, and Livy
Users today expect to run distributed computing systems, like Databricks/Spark, within hosted notebooks, like Databricks notebooks. This is because of a technical limitation of Spark. The Spark driver needs to be on the same network as the Spark workers/executors
If you ran a Spark driver on your laptop and tried to run the Executors on the cloud somewhere, performance would be poor and your network admin would be sad about the large number of bi-directional connections outside of a secure network.
Historically there have been several attempts to work around this limitation. The most famous of which is probably Apache Livy. Livy does a good job in a bad situation. It’s still not a first-class experience though.
Fortunately, Dask has a more flexible architecture than Spark. The equivalent of the Spark driver is actually split into two components, the Dask Client which sits in the user’s session, and the Dask Scheduler which co-exists with the workers wherever they live on the network. In many cases these both live in the same process, but they don’t have to. They can be separated.
This means that you can set up a Dask scheduler and workers on some remote compute resource (like the cloud) and then drive it comfortably from a local resource (like your laptop). This provides opportunities for a more ergonomic local+remote compute experience.
Everyone was trained to expect to have to run their distributed jobs on their compute network. This is because of a technical limitation of Spark’s unified driver. It doesn’t have to be this way. We can do better.
History and Motivation 2: Pangeo
The first major Dask-on-the-Cloud deployment was probably Pangeo, a system that we made to serve the earth science community. It was a combination of JupyterHub + Kubernetes + Dask and it allowed earth scientists to navigate to a website, be taken to a live Jupyter notebook session, and from there be able to deploy Dask to run interactive analyses at scale. It was a tremendous success and has fundamentally changed the way that domain views data analysis today.
However, very quickly the development team was buried in a pile of reasonable but hard-to-satisfy feature requests. Here are a few representative ones.
- I conda install a new library, but when I restart my session that library is no longer there.
- How do I run a batch job?
- Can I use VSCode/PyCharm/Emacs instead of Jupyter?
- I logged on again but my changes are no longer here
- How do I share this notebook with a colleague?
- How do I access data stored on my computer?
These are all questions that are trivial on one’s personal computer, but become tricky on the cloud. The technologies that this system is based on like Docker, S3, and Kubernetes lack the malleability and consistency of a personal laptop computer. Mimicking the laptop experience in the cloud is hard to do well. Just ask any Databricks user.
History and Motivation 3: Existing Enterprise IT Stacks
Additionally, Coiled’s initial customers were large companies with pre-existing IT solutions for data science. They already had infrastructure for Jupyter, batch jobs, hosting models, and so on. They wanted Dask infrastructure, but did not want to throw away all of their existing infrastructure just to suit us. They wanted a system that played well with other infrastructure, rather than created a walled-garden like Databricks.
Coiled integrates into existing user environments
So at Coiled we embraced the idea of driving the cluster from the user’s laptop. This is hard, but powerful.
It’s hard because we need to think a lot more about auth and networking in order to move data and credentials safely and securely between the Dask cluster and the User dev environment. We also need to work to mimic aspects of the user’s local dev environment out to remote Dask workers.
Let’s look at a few of these challenges:
- The user needs to be authenticated to and be granted correct access to cloud resources
- After the Dask cluster is started the worker needs to connect to that cluster securely. Both sides need a trusted intermediary to pass around connection certificates.
- In more secure settings we need Coiled to proxy that connection for the user
- The user may have installed something locally on their laptop, we need to install that remotely as well. (See Software Environments — Coiled documentation and the PipInstall Worker Plugin)
- The user may have cloud credentials locally on their laptop, which we may need to forward the workers so that they can access the same data that the user can from their dev environment (See Forwarding Cloud Credentials – Coiled Documentation)
- The user’s local networking policies may forbid network access across unfamiliar ports, so we are in the process of adding Websocket Comms to Dask to route traffic on the same port as standard HTTPS traffic (which is typically allowed within most corporate intranets)
The result is that Coiled enables users to run Dask securely from anywhere, including Coiled hosted Jupyter notebooks, local Juptyer notebooks, local terminal sessions, rich IDEs, batch jobs managed with Coiled, batch jobs managed with local tools like Cron, other products like Prefect or Plot.ly, and so on.
By tackling these hard network accessibility issues we’ve made Dask accessible virtually anywhere, ranging from individual students to Coiled customers in some of the more locked-down institutions.
User Experience and Initial Feedback
For users coming from Databricks this change in thinking can take some time. Users often assume that they need to run in hosted notebooks and it takes them a few days to unlearn that behavior.
> I realized I could just use Coiled from PyCharm and I’m good now. Feel free to ignore my Jupyter related feature requests from last week.
Negative feedback and lessons learned
Some users love this way of working. It feels more natural. But it’s not best for everyone. Let’s see some counter-examples.
- Educators: Or in general anyone who wants to provide a highly uniform and controlled environment to less sophisticated users. These people prefer restriction. This includes university professors and also data science leads of groups with less technical experience. For these folks, the prescribed experience of Databricks may be better.
- RAPIDS: The RAPIDS stack generally assumes that you have a GPU from wherever you’re running your code. Some workflows, like GPU-accelerated XGBoost, are smooth with Coiled while others, like Dask-cuDF, are awkward if you don’t have a GPU on your local machine. In these cases having a GPU-hosted notebook can be convenient.
- Large graphs: For larger workloads sending large amounts of metadata / task graphs over the internet can take a while. (see Accelerating the Dask Scheduler, a recent talk) We’re addressing this with High Level Graphs in Dask, but this is still an opt-in feature.
Integrating remote distributed computing with local development environments has been a challenge. The resulting user experience is game-changing though. There is still more to do in both Dask and Coiled to smooth over this experience, and I’m excited about what we’re seeing today.
The use cases where we’ve seen this fall over, though, have been informative and give me pause. It’s not clear if we should solve this problem by …
- Growing and developing a smoother notebook experience, or
- Showing how Coiled can work well with other notebook services out there like AWS Sagemaker or Azure notebooks, or
- Improving OSS packages like Dask and RAPIDS to remove the pain associated with these issues at a lower level
If you’re interested in taking Coiled Cloud for a spin, which provides hosted Dask clusters, docker-less managed software, and zero-click deployments, you can learn more and sign up for free today when you click below.