Scalable Machine Learning in Python

Tom Augspurger, Data Scientist at Anaconda and lead maintainer of Dask-ML, recently joined us to discuss how he thinks about scalable machine learning in Python. As Tom shared with us,

“You have your machine learning workflow that works well for small problems. Then there are different types of scaling challenges you can run into … scaling the size of your data and scaling the size of your model.”

In this post, we’ll summarize the key takeaways from the stream. We cover:

  • Introducing Dask-ML, for Scalable Machine Learning in Python
  • Firing up a Coiled cluster
  • When NOT to do distributed ML
  • When distributed ML is a good idea

You can find the code for the session here. You can also check out the YouTube video here:

Introducing Tom Augspurger!

Here’s a little bit about Tom:

  • Tom works at Anaconda as a data scientist
  • Tom is the lead maintainer of Dask-ML, a library that aims to do parallel scalable machine learning in Python
  • According to Coiled founder Matt Rocklin, Tom “maintains pandas, maintains Dask, and maintains the connection between pandas and Dask, which is not a trivial process
Tom Augspurger presenting his "Scalable Machine Learning with Dask" talk at AnacondaCon 2018.
From https://www.youtube.compurger/watch?v=tQBovBvSDvA 

Introducing Dask-ML, for Scalable Machine Learning in Python

Dask-ML wasn’t always as grown up as it is now. The earliest iteration of Dask-ML, in fact, was just some documentation on how to use Dask and libraries like scikit-learn together. It was home a mish-mash of libraries that didn’t have a central home—libraries like dask-glm, dask-xgboost, dask-tensorflow, and then also some utilities like train_test_split.

Then in October 2017, the Python community started a proper Dask-ML library, which collected these various pieces as well as provided a new home to implement tools like custom estimators and much more. Then, perhaps most simply and importantly (as we’ll see later) according to Tom, Dask-ML provides a home for Dask’s joblib backend.

Dask-ML is not just one project. It is several different projects, as very different machine learning algorithms live under the same library umbrella. There wasn’t and isn’t one clear way to combine machine learning with Dask’s distributed computing infrastructure, just as there is no one clear way to do machine learning in general. This makes Tom’s job fun but challenging. Though much of the low-hanging fruit has been implemented, there is so much more that could be added, so contribute if that interests you!

Let’s start coding in a Jupyter Notebook!

When might you need scalable/distributed ML? Tom uses this framework: compute-bound problems vs. memory-bound problems.

Tom started his live coding session by firing up a hosted and scalable Dask cluster provided by Coiled. A Coiled cluster!

Tom Augspurger's Jupyter Notebook setting up a Coiled cluster.

Before beginning, Tom emphasized that while Python has great tools for single-node machine learning, distributed computing is fundamentally harder. So always ask yourself before you jump in, “Do I actually need distributed machine learning?”

When might you need scalable/distributed ML? Tom uses this framework: compute-bound problems vs. memory-bound problems.

A four-quadrant graph with model size on the y-axis and data size on the x-axis.

When NOT to do distributed ML

“Before you try to do scalable distributed machine learning, you should definitely plot learning curves. I’d fire myself if I didn’t mention that.”

If your datasets aren’t too large, that is, they fit comfortably in RAM, and your model isn’t too large, you don’t need Dask. You’re in the bottom-left hand quadrant. Here, you’re much better off using something like scikit-learn, XGBoost, and similar libraries. Dask-ML is for folks running into scaling challenges with those libraries.

A four-quadrant graph with model size on the y-axis and data size on the x-axis. The bottom-left quadrant is circled.

Tom notes, “Before you try to do scalable distributed machine learning, you should definitely plot learning curves. I’d fire myself if I didn’t mention that.” That can even mean using Dask to fit a learning curve in parallel, i.e., using a distributed cluster to figure out if we should do distributed machine learning. This works because scikit-learn’s built-in learning_curve uses joblib internally.

Here’s an example (which Tom dives into later in the stream) where the answer is no:

Tom Augspurger plotting learning curves.

Tom notes, “Past like 15000-20000 training examples, there’s basically no difference between how the model’s performing. So as the number of examples grows, this model isn’t getting any better. There’s no reason to feed it the full million-row dataset. There’s no need for distributed machine learning in this case.”

When distributed ML is a good idea

👏 BIG 👏 DATA 👏 ISN’T 👏 THE 👏 ONLY 👏 USE 👏  CASE 👏 FOR 👏 DISTRIBUTED 👏 COMPUTING 👏

Let’s move horizontally to the bottom-right quadrant. Your model isn’t too large, but your data is massive. Your dataset has a lot of rows, and all of which can’t fit into your machine’s memory. This means you’re RAM (or “memory” bound). Dask-ML can help you.

Let’s move back to the top-left quadrant. Here, you’ve got a big ol’ model. Model size, model complexity, how long your model takes to train, etc.. This is our framework’s vertical axis, where our problems become CPU (or “compute”) bound. Tom notes that this is a “very fuzzy line as you can fit any given model with infinite time but at some point it interrupts your workflow and you become CPU bound. The cycle of developing a model and iterating on it is just too slow so you want to speed that up with a cluster.” Dask-ML can help you here too. People often forget about this type of problem. (👏 BIG 👏 DATA 👏 ISN’T 👏 THE 👏 ONLY 👏 USE 👏  CASE 👏 FOR 👏 DISTRIBUTED 👏 COMPUTING 👏)

This four-quadrant framework works for all distributed computing problems, but Tom loves it especially for distributed machine learning “since the strategies you have to employ differ so wildly depending on the axis you’re scaling along.”

Tom’s tips

Next, Tom shared his strategies for solving both compute-bound and memory-bound machine learning problems. We’ll summarize his process in an upcoming blog post, so be sure to subscribe to Coiled’s email list below to get that as soon as it’s released! 
We’re grateful for Tom’s time and expertise. You can follow him on Twitter at @TomAugspurger.

Do you need to do Machine Learning at Scale?

It was great to see Tom do all of this scalable machine learning using a Coiled cluster! You too can get started on a Coiled cluster immediately. Coiled also handles security, conda/docker environments, and team management, so you can get back to doing data science. Get started for free by joining our Beta.

Share

Sign up for updates