Gil Forsyth and Mike McCarty joined us recently for a Webinar on Distributed Machine Learning at Capital One. Gil is a Senior Manager of Software Engineering and Mike is a Director of Software Engineering at Capital One. They discussed how Capital One scales machine learning workflows, the challenges of distributed computing in an industry setting, and their learnings from using Dask. You can access the webinar recording by filling out the form below.
In this post, we will cover:
- How Capital One scaled machine learning before Dask;
- Why they moved to a Dask-based workflow;
- The challenges they faced and what we can learn from that;
- Some examples – Scaling XGBoost and adopting RAPIDS for GPU acceleration.
Scaling Machine Learning before Dask
Capital One has a large community of data scientists who come from diverse backgrounds like physics and statistics. Most of these data scientists use Python as their daily-driver language, but a small subset of them also use JAVA, Scala, R, and Spark.
Data Scientists at Capital One have the following options to scale their workflows:
- Use a singular cloud instance provided by Capital One.
- Scale to a distributed computing system with the assistance of a tech-team.
- Learn frameworks like Spark, H2O, and databricks.
For a long time, teams scaled their workflows by rewriting it. Imagine data scientists training their models in scikit-learn, then handing it over to someone else, to implement it in Spark using JAVA. This is an old practice and has largely gone away, but some companies still follow it. This process has a lot of unwanted risks. It becomes a source of friction when Spark and the PyData ecosystem do not get along, and sometimes distributed architecture is so different that it’s not even the same as the original model. Capital One wanted a solution to robustly scale their Python computations inline with the existing PyData stack, that’s where Dask came in.
Growing into Dask
Gil says not everyone needs scalable machine learning. There are two ways to look at scalable computing:
- Memory bound – not enough memory on your local machine, or
- Compute bound – not enough CPU capacity on your system.
Memory constraints are comparatively easier to overcome because many cloud instances provide a lot of memory, but CPU constraints require distributed computing, this is challenging, but also worth it. The next logical question is, how do you move to distributed computing? It’s a two-part answer: people and infrastructure.
The People Angle of Distributed Computing
People are going to need to do both machine learning and distributed computing. A big part of moving to a distributed computing workflow is educating people. Capital One held internal tutorial sessions that had good turn-out rates. This shows that there is an appetite to learn Dask.
It’s important to train people because the distributed computing paradigm is very different from local computation, there are very different costs to keep in mind. Learning some under-the-hood details is valuable for predicting and tuning performance. Gil says:
“I call it the ‘original sin of databricks’, they made spinning up a spark cluster super easy!”
The Infrastructure Angle of Distributed Computing
There are many ways to deploy Dask. Capital One uses SSH, YARN, Kubernetes, and sometimes even manual deployments. They also use AWS EMR and Dask Cloud Provider. Maintaining all of these is difficult, so services that encapsulate all of these are getting increasingly popular, like Coiled Cloud!
Gil talks about Capital One’s story of their RAPIDS Infrastructure on AWS — a small number of nodes gave them massive compute!
Parquet is a very powerful file format for distributed computing. Most data in Capital One is a tabular format and Parquet is perfect for this scenario. It is very close to a universal serialization format. Gil says this is an interesting place to insert Dask into your workflow, it’s not uncommon to have ETL in Spark and then pick from there using Dask.
There are some pain points to be aware of though. Spark and Dask both follow the
parquet spec, and the problems lie in the:
- very tiny partitions that Dask doesn’t like very much, and
- lack of a global
Watch the webinar recording to learn more about how this combination is used by the team!
Scaling XGBoost and GPU Acceleration with RAPIDS
The first step while scaling is always to understand your problem, is it CPU bound or memory bound? It’s important to consider these because the workflow is different for each case.
Exhaustive grid search over a bunch of folds is a CPU-bound problem. You can use Dask with a joblib backend for this and train all folds in parallel. On the other hand, training with more data is a memory-bound problem. Here, instead of regular XGBoost, we use distributed XGBoost.
At this point, you might be curious if we can do both at the same time? It is possible, but very difficult because the above methods aren’t designed to work together optimally. But, it can be accomplished by adding a new layer of orchestration using tools like Prefect.
Let’s look at another example – adopting the scikit-learn API. Generally, you will be working through these three paradigms: single core workflow on your local machine, then using Dask for leveraging all your CPU cores, then using RAPIDS for working with GPUs:
Key points to consider here are:
- Which data structure you’re using to load data into memory, and
- Does your algorithm know the nuances of working with said data structures (for example the nuances between pandas and Dask DataFrames)
In the webinar recording, Mike introduces practical and useful checks the team uses as guidelines for which data structure to use in a given situation.
Scaling Dask and Coiled
Mike goes on to discuss different options to scale with Dask. He also uses Coiled to scale to the cloud and demonstrates some checks to verify which paradigm to use. You can follow along on the supporting notebooks and learn more in their blog post: Custom Machine Learning Estimators at Scale on Dask & RAPIDS!