Dask for Everyone, Everywhere
• September 28, 2020
Data scientists increasingly solve large machine learning and data problems with Python. But historically Python struggled with parallel computing. This led many of us in the community to make Dask, a library for parallel computing and data science for Python.
Dask has been a go-to solution for scalability in the Python data science stack for years, with deep integrations to dozens of the most commonly used libraries. However, while some groups get great results from Dask, others struggle, typically with deployment challenges. Setting up distributed computing systems within an organizational environment is hard.
That’s why we created Coiled, to increase the accessibility to computing for everyone. Today we’re happy to announce a new product to solve this problem, Coiled Cloud, as well as our recent funding. Read on, get involved, and help us make this a great experience for scalable data science and machine learning.
Python is great for data science and machine learning at scale
Python drives some of the most exciting research today across several verticals:
- LIGO’s discovery of gravitational waves in 2016 used many OSS tools from the PyData ecosystem
- JP Morgan’s trading platform Athena contains 35 million lines of Python code
- Walmart uses Python, Dask, and XGBoost to “tear through their massive-scale data analytics and machine learning”
- Netflix runs over 150,000 batch Jupyter notebook jobs a day
- The Event Horizon Telescope team used the OSS PyData stack to create the first ever image of a black hole!
These domains derive scientific and business value from massive amounts of data, showing what is possible when we apply community-based open source software to the world’s toughest problems.
Open Source Infrastructure is Hard
Unfortunately, many organizations have trouble adopting OSS institution-wide. PyData at scale is great, but only if you are able to
- Provision machines on the cloud or on prem
- Set up Kubernetes
- Authenticate users and apply quotas to them
- Manage custom software environments and rapidly changing docker images
- Secure networking and data access end-to-end
- Keep a system up 24/7 with limited staff
These problems are both critical to get right, and also outside the experience of most data science / machine learning practitioners. These devops challenges are the primary bottleneck to adoption of data tooling that we see today.
Coiled features make Dask easy
To address this, we’re launching a product, Coiled Cloud, which manages Dask across diverse contexts. Coiled provides everything that we’ve seen groups need in our long history of deploying Dask in different institutions.
- Hosted Dask clusters on cloud resources, or on-prem
- Managed software environments which let you build and share docker images from conda/pip environment specifications
- GPU support which allows people to easily switch hardware architectures, and explore newly accelerated libraries like XGBoost, RAPIDS, PyTorch, and more
- Global multi-region support, because it turns out that not all data is stored in a data center on the US East Coast
- Cost management, which allows people paying the bill to turn on features like user quotas, toggleable GPU access, and default idle timeouts
- End to end network security, which gives confidence that the right people can access data, and the wrong people can’t
- Enterprise integration which makes it possible to deploy Coiled in large organizations
- … and lots more
We could talk about Coiled features for several blog posts, but for now we’re going to point you to the Coiled product page, the docs, and suggest that you give it a try (spinning up a cluster following the quickstart takes about two minutes). Instead we’re going to talk a little bit about our objectives, and some recent funding news.
Infrastructure accessible to everyone
Dask is used by both some of the largest companies on Wall Street, as well as by countless individual researchers and students around the world. At Coiled we strive to support all of these stakeholders as they in turn strive to impact their world.
This accessibility design constraint forced us to rethink how we architect hosted systems, focusing on accelerating the individual user experience. The result is, we believe, the smoothest way for anyone to scale computation today.
If you’re a data scientist and want to try it out, then try the following:
$ pip install coiled
>>> import coiled >>> cluster = coiled.Cluster()
You’ll be up and running in about a minute.
There are many things that you will want to change over time. You’ll want to create your own software environments. If you work for a company you’ll want to provide cloud credentials so that computations are run in your private account. If you live outside of North America you’ll want to specify different regions to run in. If you have students you’ll want to manage teams, craft notebooks, and share them.
Coiled supports all of these features, as you would expect, but it doesn’t force them. It starts simple, with room to grow. This results in a product that is both easy to get started, and also incredibly nimble.
Run Coiled anywhere
While Coiled hosts Jupyter notebooks, it doesn’t force you into them. You can connect to Coiled from your laptop or other cloud services. This opens up Coiled to a whole host of other applications outside of the typical cloud hosted data science flow.
As an example, this makes it easy to couple Coiled to scientific imaging applications like Napari that operate on the desktop, or combine with other cloud services, like Prefect, without Coiled having to make explicit integrations. Like Dask itself, Coiled was designed for integration.
To get things started, we took on some investment. We’re happy with how this turned out.
Earlier this year, Coiled raised $5M in seed funding in a round co-led by Costanoa and IA Ventures, with individual investments by Kaggle co-founders, Anthony Goldbloom and Ben Hamner, Techammer spearheaded by Cloudera co-founder Jeff Hammerbacher, and early Mesosphere employee Tim Chen.
Their help and advice has been instrumental so far in setting up the company and we’re excited about the capacity that this gives us to explore this space quickly.
Let’s enable scalable data science and machine learning together
We hope that, along with contemporary advances in algorithms, user interfaces, and datasets, Coiled’s focus on accessible scalable computing helps us improve the ability for society at large to benefit from modern at-scale data science and machine learning.
Building a product that solves the challenges of modern data professionals means we’re taking every opportunity we can to speak with them. If you’d like to chat with us about the challenges you face in scaling your Pythonic data work, we’d love to hear from you.
If you’re doing data science and/or machine learning at scale and like to break things, we’d love for you to take Coiled for a test drive.
—matt, hugo, and the entire Coiled team