Distributed Data Science for IT professionals

by Hugo Bowne-Anderson, Matthew Rocklin

Scaling Data Science is a Team Sport

An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.

Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.

In this post, we’ll discuss the pain points felt by IT when trying to deploy data processing technologies to provide data scientists with distributed computing. In other posts, we do the same for data scientists and for management.

We often see the paint points felt by IT boil down to three main challenges:

  1. Predictability: Can I ensure that critical production workloads will continue unimpeded?
  2. Security: Is our sensitive data appropriately protected from external or internal threats?
  3. Observability: Can we see what’s going on?

We’ll call out these challenges in each of the sections below.

Slide from our PyCon 2020 talk “Challenges of Deploying Distributed Computing”

1 - Predictability

Can I ensure that critical production workloads will continue unimpeded?

As we discussed in the companion post from the data science perspective, data science workloads are highly volatile in a variety of ways:

  1. Data science workloads are bursty, often requiring many workers for a very short time at unpredictable intervals;
  2. Data science workloads often require broad access to install a variety of software and access a variety of data;
  3. Data scientists themselves are often not as technically familiar with distributed systems as IT or data engineering professionals.

This volatility can negatively impact systems that run critical workloads. As IT professionals, keeping these systems up and running is often our primary mandate, and so we may have to say “no” to the data science crew who wants to train a really big machine learning model on the production cluster.

2 - Security

Ensuring the right people can access their data, and the wrong people can’t

Every company has private data that they are compelled to protect for business, legal, or ethical reasons. Distributed data science workflows are difficult for a few reasons:

  1. They often touch many different data sources;
  2. Many machines are involved, all of which will need proper auth from the user;
  3. We communicate data over potentially insecure wires.

Data scientists are accustomed to working on a personal computer, over which they exercise physical control. As a result, their daily practices don’t often take into account thinking about authentication or network security. This isn’t something that they are accustomed to thinking about, or that they’re incentivized to think about. This is our job as IT professionals.

Fortunately, robust solutions for these problems have been around for decades. It’s our job to properly hook these systems up. Given the growth of distributed computing systems in the last decade, keeping up with integrating them with internal auth/CAs can be a challenge.

3 - Observability

On most days IT plays a reactive role, they respond to fires and keep things running smoothly. In order to do this job well, they need to know what is going on.

Finally, the most common request we get from IT departments is about observability. On most days IT plays a reactive role, they respond to fires and keep things running smoothly. In order to do this job well, they need to know what is going on. Fortunately many systems exist today to collect and aggregate basic metrics. As long as distributed data science frameworks publish metrics on standard protocols, like Prometheus, existing systems should pick everything up.

However, open source projects will need to curate a set of pragmatic metrics, and recommend views of those metrics relevant for diagnosing system health.

Ergonomics give us control

As a design principle, let’s make our systems as ergonomic as possible so that data science users stay within our guidelines rather than working around them.

Finally, we are incentivized to create systems that are ergonomic for data scientists, otherwise data scientists will subvert our efforts.

As IT, it is often our job to say “no”:

  • No, you can’t take over the entire cluster tonight;
  • No, you can’t run that GPL-licensed software in production;
  • No, you can’t give your credentials to your friend who is having trouble accessing the database.

And yet we know from experience that if we say “no” too often then data scientists find ways to work around our constraints. What they sometimes lack in experience using distributed systems they more than make up for in creativity. And so, as a design principle it is in our interest to make our systems as ergonomic as possible so that data science users stay within our guidelines by choice rather than feel compelled to work around them.

This is really hard to do. On a cultural level, most IT professionals don’t have experience operating as data scientists and vice-versa. What seems ergonomic to an IT professional rarely seems ergonomic to a data scientist. Consider the AWS APIs as a classic example :)

Conclusion

It is straightforward but time-consuming to deploy distributed data science frameworks in a mature setting. Integrating with auth systems, setting up certificate authorities, passing around credentials, setting up metrics, and more are all normal activities for IT teams, but this is certainly an investment.

Anecdotally, we find that this IT time investment is where most companies stall on adopting transformational data science stacks. Many companies have a plan in place to transition to scale data science, but in the meantime are mostly asking their data scientists to operate on single large machines. This time in purgatory always lasts longer than expected.

Instead of paying for the IT team to develop this infrastructure and incurring that delay, companies increasingly purchase managed solutions like Coiled Cloud. This ends up saving money and increasing velocity.

We’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for organizations that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.


Want to stay up to date?