Challenges of Scaling Data Science in Enterprise

by Matthew Rocklin

Summary

Deploying data science and machine learning frameworks to data science teams is made complex by organizational constraints like security, observability, and cost management.

This post lays out the challenges that arise when exposing scalable computing to data science teams in large institutions, and the enterprise infrastructure necessary to meet those challenges.

We then finish with a short pitch for, and then against, all-in-one data science and machine learning platforms.

Motivation

Data science and machine learning frameworks like Dask, Spark, and TensorFlow promise to let you scale your workloads on large distributed clusters of machines as easily as if you were operating on your own laptop.

This is never as simple as advertised, except in research lab settings where data science groups have full control over their infrastructure.

TODO: image of dask green check marks

To be clear, these frameworks are doing exactly the job that they promised. They provide consistent APIs that let data science users scale from laptops to massively parallel machines. However in order for them to operate within an institutional context they need support from other infrastructure to manage things like …

  • Authenticate users
  • Secure data access
  • Secure connections
  • Manage resources among teams
  • Gather distributed logs and metrics
  • Track, limit, and optimize costs
  • Manage custom software environments

… and much more.

Systems like Dask, Spark, and TensorFlow don’t entirely resolve these issues on their own. They need help from other infrastructure like Docker, Kubernetes, Prometheus, TLS, certificate authorities, LDAP, SSO, and so on in order to be used within an institutional setting.

This post motivates the need for that infrastructure.

Manual Dask Deployments

First, let’s take a look at how we would have set up Dask (or any similar system) 20 years ago on unmanaged hardware.

We would walk up to one computer, log in, and then install Dask and run the Dask scheduler.

$ conda install dask
$ dask-scheduler
Scheduler listening at tcp://scheduler-address

We would then walk up to several other computers, install Dask, and run Dask workers, pointing them to the Dask scheduler

# Machine one
$ conda install dask
$ dask-worker tcp://scheduler-address

# Machine two
$ conda install dask
$ dask-worker tcp://scheduler-address

# Machine three
$ conda install dask
$ dask-worker tcp://scheduler-address

Then on a final machine we would log in, set up a Jupyter notebook, and connect to the scheduler from a Python session

# Machine four
$ conda install dask jupyter
$ jupyter notebook
>>> from dask.distributed import Client
>>> client = Client("tcp://scheduler-address")

This workflow looks somewhat similar for other systems like Spark, Tensorflow or a database.

This is really simple. Anyone with access to multiple computers can do this today, and they’ll be pretty happy, at least until one of the following things happen:

  1. They get tired of logging into each of the machines manually
  2. They decide to update a Python library, but make a mistake in updating it universally across the machines
  3. Someone else walks into the room and wants to use the same machines
  4. Someone else walks into the room and listens to the data being sent over the network
  5. Someone else walks into the room and connects to the same Dask cluster, to impersonate the original researcher
  6. … there are many other fail cases here

You can’t hold a cluster in your hand

In principle, all of the challenges that we run into come from the fact that our hardware has been split from one machine into several, and we need to learn how to share that hardware with others in a way that is safe, secure, and cheap while still being effective.

Challenges to Deploying Data Science Frameworks

We can split these challenges into three main categories

  1. Environment and data management: Innovate

    Everything that helps simulate the smooth laptop experience for the data scientist

    1. Software: Do these machines all have the same softawre installed?
    2. Resource sharing: Can many people share the same hardware?
    3. Data Access: Where is the user’s data?
  2. Security and Compliance: Don’t get sued

    1. Auth: Do they have access to these machines?
    2. Security: What stops others from connecting and running arbitrary code as the user?
  3. Cost Management: Don’t go broke

    Especially important when we start renting infinite scalability on the cloud

    1. Avoidance: What stops a novice from idling 100 GPUs?
    2. Tracking: How much money is everyone spending?
    3. Optimization: How do we profile and tune for cost?

We’ll go over each topic in the following sections

Environment and data management

Data science and machine learning workloads introduce IT constraints beyond typical data engineering workloads.

  1. They’re often interactive, and so require rapid spin-up/spin-down of machines with very bursty distributions.
  2. Software environments change rapidly, often several times a day
  3. Data science / ML users may not be as technically comfortable with tooling like Kubernetes or Docker

This section is arguably the less well understood of the three. We have solutions for these problems, but they don’t take the data science workflow into account.

Software

Data scientists experiment with different libraries constantly. As a result their software environments are often a mess. Managing this mess across several machines simultaneously is highly error prone, and regularly results in inconsistencies if done manually.

# machine one
>>> import pandas
>>> pandas.__version__
1.1.0

# machine two
>>> import pandas
>>> pandas.__version__
0.14.2

To resolve this we typically rely on either …

  1. Docker on container-based frameworks
  2. A shared network file system (NFS) on HPC frameworks

The NFS system generally works well if the data scientist/ML researcher is allowed to manage their own environment.

However Docker provides a bit of a cultural challenge. Data scientists may not be comfortable with Docker, or if they are they may not have access to manage images, or if they do they may just not want to spend the time to do this very hour.

Here is a recent tweet from senior data scientist at Novartis, Eric Ma:

What we see today

If you’re on an HPC system then a user-managed conda installation on an NFS is swell.

If you’re on a container based system (Cloud, Kubernetes) then we see one of the two solutions:

  1. Data scientists can’t update environments. Only IT/team leads can do this. This functions, but creates frustration for all involved.
  2. There is some home-grown solution that consumes conda/pip environment files and produces Docker images on the fly

Resource Sharing

Alice: I’d like the cluster please Bob: Sorry, I’m on it now Alice: Well, can I use half? Bob: Nope, sorry Alice: Well, I’m just going t sign in anyway

Decades ago people used to manage cluster resources manually with social interactions over e-mail, much like the situation above. Today this is mostly handled with systems like Yarn, Kubernetes, or HPC job schedulers like SLURM/PBS/LSF. These systems allow IT to define policies about what to do when everyone wants the cluster, and how people should have to wait for resources, or how those resources should be split on an as-needed basis.

However configuring these systems appropriately can be a challenge, and it’s uncommon to find the more volatile needs of data scientists and ML researchers accounted for in today’s policies. This is also something that only occurs once an organization has grown large enough to have its own full-time system administrator.

What we see today

On HPC systems the existing queues work enough so as not to be a huge pain (although see a related post Tips for Interactive HPC).

For Container based systems (cloud, Kubernetes) we typically don’t find policies organized around the highly volatile workloads and novice users from data science / ML. Quota systems are typically done after-the-fact with tracking rather than stopping people from using too much as they’re using it. Data scientists are given access to the Kubernetes cluster on a case-by-case basis until there are too many of them, at which point they’re usually shunted off to some side cluster.

Data Access

Where is my data?

While working on a hard drive a Data Scientist has full control over their data. They download it from the internet, put it in some /datasets directory, can navigate to and discover it from their operating system’s user interface. It’s like any other tool they’ve used in the last decade.

But when they move to distributed systems their data needs to move to some shared data store, either a shared network file system, or a cloud object store that is visible from all of their Dask workers (or Spark executors, or Tensorflow …). That shared data store comes with a new set of tools, user interfaces, and permissioning systems. The data scientist can no longer apply the intuition they’ve built up from using computers over a lifetime to this new system.

Security and Compliance

Don’t get sued

When our machine leaves our hands, is split apart into many machines connected by insecure wires, and is managed by or rented from other people, we suddenly have to start caring about who can see which bits of data.

Fortunately, security is an old problem and there are well trusted solutions.

In a data science setting our objective is to use these solutions in a way that minimally disrupts the innovative process of a data scientist / ML researcher. This is both to improve their workflow, and to ensure that they don’t feel the urge to work around the system. Policies that are easy to follow get followed.

Auth

>>> dask.dataframe.read_csv("s3://bucket/myfile.csv")
PermissionError:
    sorry, machine-two lacks permissions
    to access this data on your behalf

You probably already have a system to have individuals log in.


Want to stay up to date?