Deploying data science and machine learning frameworks to data science teams is made complex by organizational constraints like security, observability, and cost management.
This post lays out the challenges that arise when exposing scalable computing to data science teams in large institutions, and the enterprise infrastructure necessary to meet those challenges.
We then finish with a short pitch for, and then against, all-in-one data science and machine learning platforms.
Data science and machine learning frameworks like Dask, Spark, and TensorFlow promise to let you scale your workloads on large distributed clusters of machines as easily as if you were operating on your own laptop.
This is never as simple as advertised, except in research lab settings where data science groups have full control over their infrastructure.
TODO: image of dask green check marks
To be clear, these frameworks are doing exactly the job that they promised. They provide consistent APIs that let data science users scale from laptops to massively parallel machines. However in order for them to operate within an institutional context they need support from other infrastructure to manage things like …
… and much more.
Systems like Dask, Spark, and TensorFlow don’t entirely resolve these issues on their own. They need help from other infrastructure like Docker, Kubernetes, Prometheus, TLS, certificate authorities, LDAP, SSO, and so on in order to be used within an institutional setting.
This post motivates the need for that infrastructure.
First, let’s take a look at how we would have set up Dask (or any similar system) 20 years ago on unmanaged hardware.
We would walk up to one computer, log in, and then install Dask and run the Dask scheduler.
$ conda install dask $ dask-scheduler Scheduler listening at tcp://scheduler-address
We would then walk up to several other computers, install Dask, and run Dask workers, pointing them to the Dask scheduler
# Machine one $ conda install dask $ dask-worker tcp://scheduler-address # Machine two $ conda install dask $ dask-worker tcp://scheduler-address # Machine three $ conda install dask $ dask-worker tcp://scheduler-address
Then on a final machine we would log in, set up a Jupyter notebook, and connect to the scheduler from a Python session
# Machine four $ conda install dask jupyter $ jupyter notebook
>>> from dask.distributed import Client >>> client = Client("tcp://scheduler-address")
This workflow looks somewhat similar for other systems like Spark, Tensorflow or a database.
This is really simple. Anyone with access to multiple computers can do this today, and they’ll be pretty happy, at least until one of the following things happen:
In principle, all of the challenges that we run into come from the fact that our hardware has been split from one machine into several, and we need to learn how to share that hardware with others in a way that is safe, secure, and cheap while still being effective.
We can split these challenges into three main categories
Environment and data management: Innovate
Everything that helps simulate the smooth laptop experience for the data scientist
Security and Compliance: Don’t get sued
Cost Management: Don’t go broke
Especially important when we start renting infinite scalability on the cloud
We’ll go over each topic in the following sections
Data science and machine learning workloads introduce IT constraints beyond typical data engineering workloads.
This section is arguably the less well understood of the three. We have solutions for these problems, but they don’t take the data science workflow into account.
Data scientists experiment with different libraries constantly. As a result their software environments are often a mess. Managing this mess across several machines simultaneously is highly error prone, and regularly results in inconsistencies if done manually.
# machine one >>> import pandas >>> pandas.__version__ 1.1.0 # machine two >>> import pandas >>> pandas.__version__ 0.14.2
To resolve this we typically rely on either …
The NFS system generally works well if the data scientist/ML researcher is allowed to manage their own environment.
However Docker provides a bit of a cultural challenge. Data scientists may not be comfortable with Docker, or if they are they may not have access to manage images, or if they do they may just not want to spend the time to do this very hour.
Here is a recent tweet from senior data scientist at Novartis, Eric Ma:
What we see today
If you’re on an HPC system then a user-managed conda installation on an NFS is swell.
If you’re on a container based system (Cloud, Kubernetes) then we see one of the two solutions:
Alice: I’d like the cluster please Bob: Sorry, I’m on it now Alice: Well, can I use half? Bob: Nope, sorry Alice: Well, I’m just going t sign in anyway
Decades ago people used to manage cluster resources manually with social interactions over e-mail, much like the situation above. Today this is mostly handled with systems like Yarn, Kubernetes, or HPC job schedulers like SLURM/PBS/LSF. These systems allow IT to define policies about what to do when everyone wants the cluster, and how people should have to wait for resources, or how those resources should be split on an as-needed basis.
However configuring these systems appropriately can be a challenge, and it’s uncommon to find the more volatile needs of data scientists and ML researchers accounted for in today’s policies. This is also something that only occurs once an organization has grown large enough to have its own full-time system administrator.
What we see today
On HPC systems the existing queues work enough so as not to be a huge pain (although see a related post Tips for Interactive HPC).
For Container based systems (cloud, Kubernetes) we typically don’t find policies organized around the highly volatile workloads and novice users from data science / ML. Quota systems are typically done after-the-fact with tracking rather than stopping people from using too much as they’re using it. Data scientists are given access to the Kubernetes cluster on a case-by-case basis until there are too many of them, at which point they’re usually shunted off to some side cluster.
Where is my data?
While working on a hard drive a Data Scientist has full control over their
data. They download it from the internet, put it in some
directory, can navigate to and discover it from their operating system’s user
interface. It’s like any other tool they’ve used in the last decade.
But when they move to distributed systems their data needs to move to some shared data store, either a shared network file system, or a cloud object store that is visible from all of their Dask workers (or Spark executors, or Tensorflow …). That shared data store comes with a new set of tools, user interfaces, and permissioning systems. The data scientist can no longer apply the intuition they’ve built up from using computers over a lifetime to this new system.
Don’t get sued
When our machine leaves our hands, is split apart into many machines connected by insecure wires, and is managed by or rented from other people, we suddenly have to start caring about who can see which bits of data.
Fortunately, security is an old problem and there are well trusted solutions.
In a data science setting our objective is to use these solutions in a way that minimally disrupts the innovative process of a data scientist / ML researcher. This is both to improve their workflow, and to ensure that they don’t feel the urge to work around the system. Policies that are easy to follow get followed.
>>> dask.dataframe.read_csv("s3://bucket/myfile.csv") PermissionError: sorry, machine-two lacks permissions to access this data on your behalf
You probably already have a system to have individuals log in.