James Bourbeau, Dask maintainer and software engineer at Coiled, recently joined us for a Science Thursday session on “Scalable Python Deployments as a Service”.
In this post, we summarize the key takeaways from the stream. We’ll cover:
- A brief overview of Dask
- An introduction to Coiled and its offerings
- Spinning up a cluster on AWS with Coiled
- Creating a custom software environment for a cluster
- Collaboration and cost monitoring tools
- How you can get started with Coiled
You can also watch the live stream replay by clicking below.
Dask: A Brief Overview
Dask is a popular library for parallel and distributed computing in Python
Dask is an excellent OSS tool used to get the most out of data science and machine learning with Python. Dask reuses familiar APIs from the PyData ecosystem like NumPy, Pandas, and Scikit-Learn. It integrates with many libraries you may already know and love, like Xarray, Rapids, XGBoost, and more. Plus, it works well both on a single machine or running across multiple machines in a cluster. James is a maintainer and contributor to Dask.
Many will wonder: How can I launch a Dask cluster with more computational resources? There are many existing OSS projects for launching clusters on various hardware platforms, like Dask-Kubernetes and Dask-Yarn, for example. However, the caveat is that using these projects (typically) involves manually setting up or having access to infrastructure, e.g. a Kubernetes cluster, and having a deep knowledge of the system the cluster is being launched on, such as how to create an AWS IAM role with appropriate levels of permissions. Additionally, these projects lack helpful features like software environment management. This leads us to…
Coiled is a deployment-as-a-service library for scaling Python
Switching to another side of his OSS contributions: enter Coiled, where James is a software engineer. Coiled provides:
- Easily launchable, cloud-based Dask clusters
- Support for managed software environments
- Tools for collaborating and monitoring costs
Launching a Dask Cluster with Coiled
James showed us how to launch a Dask cluster with Coiled.
The scheduler and workers were running on AWS and he connected to the remote cluster from his laptop with client = Client(cluster).
We did a groupby-aggregation on a Dask cluster on AWS.
Custom Software Environments with Coiled
An early pain point and creating custom software environments with Coiled.
Next, we performed a further analysis with the dataset by using it to train an XGBoost classification model. We got a ModuleNotFoundError because dask_xgboost wasn’t installed on the workers for our clusters. This is one of the earliest pain points users tend to experience with distributed computing. We need to ensure that each machine has the appropriate libraries installed to execute our tasks now that there is more than one machine involved in our computations.
Coiled supports building custom software environments using familiar packaging conventions, like conda and pip, that you’re probably already using.
We ran the same exact code now that our cluster has dask_xgboost installed. And it worked! Ta-da! We created a custom software environment fairly easily using coiled.create_software_environment.
Hugo noted, “It looks straightforward but you’ve just done something really challenging – moving from your local environment, running an analysis to getting up in a cloud, getting your Dask cluster up and running and getting the same analysis done on a larger data set…you made it look simple but in reality, it’s quite an achievement”.
Collaboration and Cost Monitoring
Coiled provides tools for seamless collaboration and keeping track of your spending
James explained that Coiled allows for:
- Sharing your software environments & cluster settings with your friends and colleagues
- Tracking resource usage on a per person and per account basis
- Automatic cluster shutdown after 20 minutes of inactivity to prevent large bills when you accidentally leave a cluster running
You can start leveraging Dask and Coiled today
A big thank you to our very own James for taking the time to deliver this fantastic Coiled demo. You can chat with Coiled users and members of the Coiled team like James when you join the Coiled community slack.
You can spin up a cluster and get your own software environment up and running today for free when you sign up with Coiled Cloud. Click below to get started.