Challenges of Scaling Data Science in Enterprise
• May 3, 2020
Deploying data science and machine learning frameworks to data science teams is made complex by organizational constraints like security, observability, and cost management.
This post lays out the challenges that arise when exposing scalable computing to data science teams in large institutions, and the enterprise infrastructure necessary to meet those challenges.
We then finish with a short pitch for, and then against, all-in-one data science and machine learning platforms.
Data science and machine learning frameworks like Dask, Spark, and TensorFlow promise to let you scale your workloads on large distributed clusters of machines as easily as if you were operating on your own laptop.
This is never as simple as advertised, except in research lab settings where data science groups have full control over their infrastructure.
TODO: image of dask green check marks
To be clear, these frameworks are doing exactly the job that they promised. They provide consistent APIs that let data science users scale from laptops to massively parallel machines. However in order for them to operate within an institutional context they need support from other infrastructure to manage things like …
- Authenticate users
- Secure data access
- Secure connections
- Manage resources among teams
- Gather distributed logs and metrics
- Track, limit, and optimize costs
- Manage custom software environments
… and much more.
Systems like Dask, Spark, and TensorFlow don’t entirely resolve these issues on their own. They need help from other infrastructure like Docker, Kubernetes, Prometheus, TLS, certificate authorities, LDAP, SSO, and so on in order to be used within an institutional setting.
This post motivates the need for that infrastructure.
Manual Dask Deployments
First, let’s take a look at how we would have set up Dask (or any similar system) 20 years ago on unmanaged hardware.
We would walk up to one computer, log in, and then install Dask and run the Dask scheduler.
$ conda install dask $ dask-scheduler Scheduler listening at tcp://scheduler-address
We would then walk up to several other computers, install Dask, and run Dask workers, pointing them to the Dask scheduler
# Machine one $ conda install dask $ dask-worker tcp://scheduler-address # Machine two $ conda install dask $ dask-worker tcp://scheduler-address # Machine three $ conda install dask $ dask-worker tcp://scheduler-address
Then on a final machine we would log in, set up a Jupyter notebook, and connect
to the scheduler from a Python session
# Machine four $ conda install dask jupyter $ jupyter notebook >>> from dask.distributed import Client >>> client = Client("tcp://scheduler-address")
This workflow looks somewhat similar for other systems like Spark, TensorFlow or a database.
This is really simple. Anyone with access to multiple computers can do this today, and they’ll be pretty happy, at least until one of the following things happen:
- They get tired of logging into each of the machines manually
- They decide to update a Python library, but make a mistake in updating it universally across the machines
- Someone else walks into the room and wants to use the same machines
- Someone else walks into the room and listens to the data being sent over the network
- Someone else walks into the room and connects to the same Dask cluster, to impersonate the original researcher
- … there are many other fail cases here
You can’t hold a cluster in your hand
In principle, all of the challenges that we run into come from the fact that our hardware has been split from one machine into several, and we need to learn how to share that hardware with others in a way that is safe, secure, and cheap while still being effective.
Challenges to Deploying Data Science Frameworks
We can split these challenges into three main categories
- Environment and data management: Innovate
Everything that helps simulate the smooth laptop experience for the data scientist
- Software: Do these machines all have the same softawre installed?
- Resource sharing: Can many people share the same hardware?
- Data Access: Where is the user’s data?
- Security and Compliance: Don’t get sued
- Auth: Do they have access to these machines?
- Security: What stops others from connecting and running arbitrary code as the user?
- Cost Management: Don’t go broke
Especially important when we start renting infinite scalability on the cloud
- Avoidance: What stops a novice from idling 100 GPUs?
- Tracking: How much money is everyone spending?
- Optimization: How do we profile and tune for cost?
We’ll go over each topic in the following sections
Environment and data management
Data science and machine learning workloads introduce IT constraints beyond typical data engineering workloads.
- They’re often interactive, and so require rapid spin-up/spin-down of machines with very bursty distributions.
- Software environments change rapidly, often several times a day
- Data science / ML users may not be as technically comfortable with tooling like Kubernetes or Docker
This section is arguably the less well understood of the three. We have solutions for these problems, but they don’t take the data science workflow into account.
Data scientists experiment with different libraries constantly. As a result their software environments are often a mess. Managing this mess across several machines simultaneously is highly error prone, and regularly results in inconsistencies if done manually.
# machine one >>> import pandas >>> pandas.__version__ 1.1.0 # machine two >>> import pandas >>> pandas.__version__ 0.14.2
To resolve this we typically rely on either …
- Docker on container-based frameworks
- A shared network file system (NFS) on HPC frameworks
The NFS system generally works well if the data scientist/ML researcher is allowed to manage their own environment.
However Docker provides a bit of a cultural challenge. Data scientists may not be comfortable with Docker, or if they are they may not have access to manage images, or if they do they may just not want to spend the time to do this very hour.
Here is a recent tweet from senior data scientist at Novartis, Eric Ma:
What we see today
If you’re on an HPC system then a user-managed conda installation on an NFS is swell.
If you’re on a container based system (Cloud, Kubernetes) then we see one of
the two solutions:
- Data scientists can’t update environments. Only IT/team leads can do this.
This functions, but creates frustration for all involved.
- There is some home-grown solution that consumes conda/pip environment files
and produces Docker images on the fly
Alice: I’d like the cluster please
Bob: Sorry, I’m on it now
Alice: Well, can I use half?
Bob: Nope, sorry
Alice: Well, I’m just going t sign in anyway
Decades ago people used to manage cluster resources manually with social
interactions over e-mail, much like the situation above. Today this is mostly
handled with systems like Yarn, Kubernetes, or HPC job schedulers like
SLURM/PBS/LSF. These systems allow IT to define policies about what to do when
everyone wants the cluster, and how people should have to wait for resources,
or how those resources should be split on an as-needed basis.
However configuring these systems appropriately can be a challenge,
and it’s uncommon to find the more volatile needs of data scientists and ML
researchers accounted for in today’s policies. This is also something that
only occurs once an organization has grown large enough to have its own
full-time system administrator.
What we see today
On HPC systems the existing queues work enough so as not to be a huge pain
(although see a related post Tips for Interactive HPC).
For Container based systems (cloud, Kubernetes) we typically don’t find
policies organized around the highly volatile workloads and novice users from
data science / ML. Quota systems are typically done after-the-fact with
tracking rather than stopping people from using too much as they’re using it.
Data scientists are given access to the Kubernetes cluster on a case-by-case
basis until there are too many of them, at which point they’re usually shunted
off to some side cluster.
Where is my data?
While working on a hard drive a Data Scientist has full control over their
data. They download it from the internet, put it in some
directory, can navigate to and discover it from their operating system’s user
interface. It’s like any other tool they’ve used in the last decade.
But when they move to distributed systems their data needs to move to some
shared data store, either a shared network file system, or a cloud object
store that is visible from all of their Dask workers (or Spark executors, or
Tensorflow …). That shared data store comes with a new set of tools, user
interfaces, and permissioning systems. The data scientist can no longer apply
the intuition they’ve built up from using computers over a lifetime to this new
Security and Compliance
Don’t get sued
When our machine leaves our hands,
is split apart into many machines connected by insecure wires,
and is managed by or rented from other people,
we suddenly have to start caring about who can see which bits of data.
Fortunately, security is an old problem and there are well trusted solutions.
In a data science setting our objective is to use these solutions in a way that
minimally disrupts the innovative process of a data scientist / ML researcher.
This is both to improve their workflow, and to ensure that they don’t feel the
urge to work around the system. Policies that are easy to follow get followed.
>>> dask.dataframe.read_csv("s3://bucket/myfile.csv") PermissionError: sorry, machine-two lacks permissions to access this data on your behalf
You probably already have a system to have individuals log in.