Posts by Matthew Rocklin

A diagram of a multi-scheduler architecture in Kubernetes.

Dask in production: Multi-Scheduler architectures

I ran across an interesting problem yesterday: A company wanted to serve many Dask computations behind a web API endpoint. This is pretty common whenever people offer computation as a service or data as a service. Today the company uses the single-machine Dask scheduler inside of a web request, but they were curious about moving …

Dask in production: Multi-Scheduler architectures Read More »

cloudy sky at daytime

Cloud Pricing

AWS computation costs roughly the following today:   On Demand Spot CPU hour $0.04 $0.0125 GiB hour $0.0045 $0.0015 On top of that different services charge a premium:   Premium AWS EMR 40% AWS SageMaker 40% DataBricks 100% However, when you pre-commit to a large allocation then you can usually negotiate this down, and get …

Cloud Pricing Read More »

people holding miniature figures

The Unbearable Challenges of Data Science At Scale

Scaling Data Science is a Team Sport An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management. …

The Unbearable Challenges of Data Science At Scale Read More »

A diagram of the promise: big data plus data science team equals profit.

Distributed Data Science for Management

Summary An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management. Solving simultaneously for all sides of …

Distributed Data Science for Management Read More »

An eye in focus with the rest of the image out of focus.

Distributed Data Science for IT Professionals

Scaling Data Science is a Team Sport An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management. …

Distributed Data Science for IT Professionals Read More »

A diagram showing how your cluster is not a laptop.

Distributed Computing for Data Scientists

Scaling Data Science is a Team Sport An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management. …

Distributed Computing for Data Scientists Read More »

A stingray cell

A Brief History of Dask

TL;DR Dask, the open source package for scalable data science, was developed to meet the needs of modern data professionals. This post describes the evolution of the Dask project and how it meets the needs of people working with medium-to-large datasets across industries (such as energy, finance, and the geosciences) and basic research (such as …

A Brief History of Dask Read More »

PSF survey with Dask users highlighted

Who Uses Dask?

Motivation People new to dask often ask “Who uses Dask?” They typically mean one of two questions: Do many people use Dask? Do people in my field use Dask? The answer to both questions is “definitely”. Dask users are numerous and varied. The user community spans a wide range of applications and professions. This post …

Who Uses Dask? Read More »

two girl illustrations

Challenges of Scaling Data Science in Enterprise

Summary Deploying data science and machine learning frameworks to data science teams is made complex by organizational constraints like security, observability, and cost management. This post lays out the challenges that arise when exposing scalable computing to data science teams in large institutions, and the enterprise infrastructure necessary to meet those challenges. We then finish …

Challenges of Scaling Data Science in Enterprise Read More »

closeup photography of cairn stone

Seven Stages of Open Software

This post lays out the different stages of openness in Open Source Software (OSS) and the benefits and costs of each. Motivation “Open Source Software” is a hot term today. As a result, people are reasonably encouraged to open up their software. This is great, but means that the term “open source” can get a …

Seven Stages of Open Software Read More »

Coiled logo with slogan "Scaling Python Simply".

Announcing Coiled Computing

Last month I announced that I was forming a Dask company. This month I am pleased to announce Coiled Computing, a Dask company. This post outlines what this company will do, and the various choices that I’ve made in its configuration. You may also want to … Visit https://coiled.io Follow @CoiledHQ on twitter What will …

Announcing Coiled Computing Read More »

Keep up to date (weekly cadence)