Dask Contributor Spotlight: Jacob Tomlinson
• December 15, 2021
Dask is built and maintained by hundreds of people collaborating from around the world. In this series, we talk to some of these Dask contributors, discuss their journey into open source development, and hear their thoughts on the PyData ecosystem of tools. In this edition, we’re excited to chat with Jacob Tomlinson, Senior Software Engineer at NVIDIA and a core maintainer of the Dask project.
Jacob has been a key member of the Dask community for over 5 years, and has had a significant impact on almost all aspects of Dask — from tooling and project management, to documentation and community building. He primarily focuses on Dask deployment tooling for distributed systems, but also likes working on web technologies like the Dask Dashboard. When he’s not working on open source projects himself, Jacob helps others do so by writing blog posts and giving presentations about Dask, RAPIDS, and broader OSS development practices.
Reflecting on his career trajectory to this point, Jacob says:
“Before NVIDIA, I worked for the Met Office, the UK’s weather/climate government agency. I joined straight out of High School and worked my way up through QA and IT to eventually join an internal R&D team called the Informatics Lab. In small R&D teams, everyone must wear many hats, so I spent time working on DevOps tooling, Software Engineering, Cloud Architecture, Geoscientific Research and High Performance Computing.”
I’m sure Jacob’s journey will bring a smile to your face, keep reading! 🙂
How did you get started with programming?
As a teenager, I was fascinated by computers. I got my first job at 13 just to earn money to upgrade parts in our family computer. Looking back, the early 00s were a strange time to be coming into programming. Windows XP had arrived, and DOS and programming had been abstracted away, and desktop application programming seemed tied to monolithic IDEs like Visual Studio which cost money and the learning curve was steep.
Building PHP web applications led me down the path to running Linux servers cobbled together out of spare parts, and getting to grips with Bash. This would lead me to Python as a systems language and eventually to building applications and libraries with it too.
Why is open source important to you?
I love open source because it is the opposite of the closed ecosystem I grew up in. The code for everything I run is available, and I can look at it and change it. If I find a bug in something I can fork the project and fix it.
It brings the ability to right click a web page and view the source to every other application I use. Then I can open up my editor and make changes to it just like I would with websites 15 years ago.
There is something satisfying about finding a tool or application that does almost exactly what you need and then fixing it up a little to be perfect for your use case.
What open source projects do you contribute to?
At work my primary focus is RAPIDS and Dask. RAPIDS is a suite of GPU accelerated open source Python tools which mimic APIs from the PyData stack including those of NumPy, pandas and scikit-learn. These tools work well with Dask and enable advanced parallelism for analytics with out-of-core computation, lazy evaluation, and distributed execution across many GPUs in a cluster.
In my spare time, I maintain a Python chatbot framework called Opsdroid. Opsdroid was born as a weekend project out of my interest in automation and a desire to learn asyncio. I was also using and contributing to open source at the time but wanted to experience being a maintainer too. Today, opsdroid integrates with many chat applications and connects to a range of third-party natural language inferencing tools to enable folks to build powerful chat based automations in Python.
I am also just a serial contributor in general. Whenever I find a bug in something I’m using I try to raise a pull request. I always like to leave something better than when I found it.
How did you first get introduced to Dask?
In 2016, a couple of my colleagues flew to the US to attend SciPy, when they returned they organised a seminar to play their favourite talks back on YouTube for others. One of those talks was Dask Parallel and Distributed Computing by Matthew Rocklin and Jim Crist.
The Met Office already had a relationship with Matt through Anaconda and Dask had been in part inspired by a Met Office tool called Biggus which provided out of core computation and lazy evaluation for NumPy. The team who were developing Biggus and other related tools decided to switch it out for Dask across their projects as a result of seeing this talk.
Once Dask had been placed at the core of the Met Office’s Python tooling, the R&D team I was a member of started exploring how we could push it to its limits and deploy it on the HPC and cloud resources we had available.
How did you start contributing to Dask?
In 2017, I had been working on a project to use Kubernetes to enable researchers to scale their Jupyter and NumPy workloads from their local workstations onto cloud resources. As Dask had been replacing Biggus and NumPy at the core of the tools they were using, I started exploring how to run Dask on Kubernetes.
Through collaboration with partners including STFC and UC Berkeley this work grew into the dask-kubernetes library and was moved under the Dask org on GitHub. Many of my early contributions were in maintaining that library, but this slowly grew to contributing to the wider Dask ecosystem.
What part of Dask do you mainly contribute to?
I continue to be the primary maintainer of dask-kubernetes but have also built and maintain dask-cloudprovider which provides native cloud integration. I tend to be the primary maintainer for other Dask deployment and packaging repos such as the Dask Docker images and Helm Chart.
I frequently contribute to Distributed and do a lot of work on the Cluster managers, Scheduler and Workers. I also make updates to the Dashboard from time to time. I also try to keep an eye on other Dask deployment projects that live outside of the core Dask and Distributed repos including dask-jobqueue, dask-mpi, dask-yarn, dask-gateway, etc.
Why does being a contributor excite you?
Dask’s adoption over the last 5 years has been incredible. Knowing that code I’ve written is running on supercomputers at institutions like NASA is a particular point of pride. Empowering others through automating things has always been a key point of satisfaction so knowing how easily folks can get up and running with Dask makes me happy.
Do you only develop on Dask, or do you use it too? If so, how does one affect the other?
Since leaving the Met Office I would no longer consider myself a user of Dask, but having the context and knowledge that I acquired as a user is invaluable to me. With Dask being a core component of RAPIDS I find myself using it a lot when teaching others about RAPIDS though.
Dask and RAPIDS are both tools that accelerate a user’s workflow, but ideally should be invisible to the user. So being able to understand what the user is actually trying to do helps me take a step back and make changes in Dask that will benefit them without them needing to worry about it.
What is your favorite part of Dask?
I’ve always thought of the Dashboard as Dask’s killer feature. Running any code that takes more than a few seconds is nerve-wracking as you don’t know what is going on inside the black box and if it has gone wrong.
Being able to open up the dashboard and see progress bars and the task stream ticking along gives you confidence that something is happening. Then being able to profile your code and explore what is taking the time is really powerful for improving performance.
Of course, there are many tools out there for observing and debugging code, but the fact that Dask gives this insight by default is amazing.
What are some things that you want to see improved in Dask?
Heterogeneous clusters are not well supported in Dask today. Much of the deployment tooling assumes each worker in your cluster will be identical. When deploying clusters manually you can mix different worker types and use annotations to pin tasks to those workers, but it means the user has to really care about Dask and as I said earlier Dask should be as invisible as possible.
I’m excited to think about workflows where perhaps a handful of functions in a computation graph would be heavily accelerated by a GPU and Dask knows how to provision and utilize that hardware automatically.
Dask also allows you to scale from one machine to thousands with ease. However, being able to scale so high means it can be very easy to spend money too. Increase one keyword argument in your cluster setup by 10x and your bill increases by 10x too. So providing better control and observability to how much a computation will cost is becoming more and more important.
What do you see in the future of Dask? And, scalable computing in general?
Today, folks are working to scale from their local machine to clusters of machines. Dask very much trades in terminology like clusters, workers, schedulers, scaling, etc. I see this all melting away to being about threads and cores and lower-level parallelism.
In the future, I think users will shift further towards datacenter scale computing where we think about servers in a datacenter the way we think about cores on a CPU today. We have a pool or large resource that we can run applications on top of.
Cloud services like AWS Lambda are already going down this path but are targeting applications like web services which involve calling the same cloud function many many times. For large computation we still think about Virtual Machines and clusters, we run EC2 instances and connect them together with Kubernetes or some other service in order to run our code. Being able to run a large computation with the same level of abstraction as Lambda will allow users to focus less on the computer science and more on whatever they are actually employed to do, be that data science, finance, geoscience, medicine, etc.
Thank you, Jacob, for all your contributions to the Python open-source ecosystem, and especially, for building and maintaining Dask. We’re so grateful to have you be a part of our community!
Who would you like to hear from next? Let us know on Twitter @CoiledHQ or send an email to email@example.com.
Thanks for reading! And if you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.