Distributed Computing for Data Scientists

Scaling Data Science is a Team Sport

An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.

Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.

In this post, we’ll discuss the data scientist pain points and all of the tooling you need around distributed data work technology to provide your data scientists with distributed computing. In future posts, we’ll do the same for IT and for management.

A diagram showing how your cluster is not a laptop.

Slide from our PyCon 2020 talk “Challenges of Deploying Distributed Computing”

Data Scientists need flexibility at scale

The end users of these cloud platforms or on-premise clusters are typically data engineers, data scientists, or analysts. These data professionals are used to working on their laptop, where they have full control and full visibility over their system, can modify it at will, and use familiar tools.

They switch to a distributed system because they have to, often because of scale, not because they want to. To be successful, we need to give them an experience that is as flexible and comfortable as what they had before, otherwise they find ways to stay on their laptop.

We often see this reduce down to three main challenges:

  1. Software: Do my machines all have the same software installed? Can I upgrade packages easily?
  2. Resource sharing: Can I share these same machines with my team? How quickly can I get 100 machines, even if only for a few minutes?
  3. Data Access: Where is my data? Can my machines access it too?

Solving these problems for data scientists is particularly difficult because data science workloads are less predictable than the data engineering workloads for which many of today’s cluster resources are currently optimized. We’ll call out these challenges in each of the sections below.

Software

How can I pip/conda install the latest scikit-learn onto the cluster?

Data scientists change their software environments several times a day. They tend to use a combination of conda environments, Python packages installed with pip, custom source code in a local /src directory, and functions defined in Jupyter notebooks, each of which change rapidly, and many of which depend on bleeding edge software. Data scientists live on the edge on software development.

However, most distributed infrastructure today is designed around more slowly moving data engineering or production workloads, which typically use docker images updated on a weekly or monthly basis. The docker image workflow doesn’t make sense for data scientists. There is too much friction and it’s not a familiar tool for many of them.

Data scientists need a system that lets them modify Python packages, and then automatically builds and deploys docker containers for them.

An Eric Ma tweet about bursting to the cloud and dynamically-changing project source code.

Resource Sharing

Is anyone using the cluster? Is it ok if I take over all of the GPU nodes for the evening?

Groups transitioning from ad-hoc in-house clusters may be familiar with emailing colleagues to ask if they could take over the cluster for a while. This works well for smaller groups, but falls over for larger organizations. Fortunately, there are mature and robust resource management systems like Yarn, Kubernetes, HPC Job schedulers like PBS/SGE/SLURM/LSF, or various Cloud APIs that can manage sharing resources between different groups.

However, the policies around these systems are typically not well optimized for data science workloads. This is for a few reasons:

  1. Data Science workloads tend to be bursty. A data scientist may want 1000 machines for a few minutes, and then will spend the next hour staring at a plot. They’ll then tweak a parameter and want 1000 machines again relatively quickly. They don’t want to sit in a queue for an hour.
  2. Data scientists are often not well educated on the use of technologies like Kubernetes or Cloud APIs. As a result these production systems are often not available to them.
  3. Even if they were knowledgeable about Kubernetes, IT groups will often grant access only grudgingly, having been burned before by other data science groups who misused these resources.
  4. Often distributed computing resources in a company today are designed to handle production workloads, and so there is often some political wrangling within organizations about who gets to have access to the cluster. Data science usually isn’t on that list and accessing bursty compute is key for scalable data science.

Supporting highly volatile workloads is critical to get right. When it fails we see some of the following negative effects:

  1. When it takes a long time to get machines, data scientists will choose to hold onto those machines all day, even if they spend most of their time staring at plots.
  2. When it’s difficult to get access altogether data scientists will often download samples of datasets to their local machines, resulting in many disparate copies of data floating around the company, in insecure side channels.
An image of a woman with her hands over her face.

What it can feel like trying to get access to machines

Data Access

Where is my data?

As data scientists working on our laptops, we’re accustomed to managing data files on our hard drives. We have tools to discover datasets, move them around, download them from the web, and so on. Tools like web browsers and file explorers are familiar to us from decades of personal computing use.

However when our data moves out to some remote storage we’re asked to learn a whole new set of tools. Those tools are rarely as ergonomic as the operating system on our laptop, and, even worse, they tend to be very different depending on where we get our data (S3 looks different than HDFS looks different than a database).

As a result we often miss out on critical data that our company has painstakingly collected for our benefit, or we rely on the same data subsets that our peers have previously downloaded. Ease encourages centralization.

Data science workloads are particularly challenging here (relative to data engineering or analysis workloads) because data scientists are very often asked to fuse many disparate data sources into novel analyses. For example we might be asked to join the web logs stored in Parquet on S3 with the customer database stored in Snowflake. Knowing how to find and authenticate against many systems is the first and largest roadblock we face.

Final thoughts

These are the types of data science problems we’re building products to solve for here at Coiled. The truth is we’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for data scientists and teams that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.

Share