Last week I finished my position at NVIDIA. Next week I form a company. This company will work to scale Python’s open source data science ecosystem, primarily using Dask.
This post mostly talks about the history of funding Dask, and then goes into the motivation for creating a company. I will follow up with plans for the new company in a future post.
Funding Dask, a brief history
Dask is an open source Python library that helps to parallelize other Python libraries to work on large datasets and on large clusters of distributed hardware.
As community driven open source projects go, Dask has been remarkably well funded. We’ve always had a few people (2-10) paid to work on Dask about half-time over the past five years.
While at Anaconda we pulled the money to support this from a diverse set of sources:
- The US Government, with an early DARPA grant
- Consulting revenue, at first from financial services companies that were Dask’s first institutional users, and then from a wider variety of institutions
- Philanthropic non-profits, in particular the Gordon and Betty Moore Foundation
- Margin on some of Anaconda’s other consulting projects
- US Research grants from NSF and NASA, with collaborations like Pangeo
(tip: this was the highest bang-for-buck out of all these options)
This money went to fund developers. These developers focused full time on adding features, fixing bugs, engaging in community ecosystem design, and answering user questions. This full-time funded development was critical to the success of Dask.
Then, about a year ago, we found a new source of funding:
- NVIDIA: A large hardware company that wanted to do a lot of work with Dask, but was much more comfortable hiring people internally than paying people externally.
Rather than try to change the culture of a large institution we went along with it, and I personally changed my employer to NVIDIA to build out this team.
“Same job, new employer” I often said, and in many cases this was true. My job was to maintain Dask, and also to make sure that the GPU work that the 50-person RAPIDS team was doing would interact well with Dask and with the surrounding Python data science community. I’m really proud of this work, and I’m honored to be a part of such a large effort.
We also hired six additional people to work on Dask specifically, along with a mix of GPU-Python ecosystem things.
Their work included the obvious Dask + GPU work:
- Dask + RAPIDS gives us GPU accelerated distributed dataframes
- Dask + CuPy gives us GPU accelerated distributed multi-dimensional arrays
As well as many community ecosystem improvements:
- You can build GPU packages on Conda-Forge now
- Dask’s Parquet reader is way nicer, with some changes upstreamed to Arrow
- It’s easier to build dashboards with Bokeh and JupyterLab
- We can trivially share data between pretty much any pair of GPU libraries like PyTorch, CuPy, RAPIDS, MPI4Py, MxNet.
(everyone except Tensorflow really (come on Google, catch up))
- High performance networking hardware like Infiniband is now available in Python due to bindings for UCX
- The Numpy API is easier to use on non-Numpy libraries
Today NVIDIA generally operates as a good partner to the ecosystem. Yes, they’re motivated to sell you GPUs,
but along the way they’re cleaning up a bunch of stuff, which is great. It’s like when someone new moves into the neighborhood and starts picking up trash and planting flowers.
The Dask team within RAPIDS within NVIDIA is going to continue working on these same problems and more. The first person we hired onto this team was my good friend and long-time colleague Ben Zaitlen, who has been managing this team for the last couple of months (honestly he does a better job at this than I ever did). Ben has been in the Python community for a
long time and I am thrilled knowing that he is leading this team into the future. More on my experience managing within NVIDIA in a future post.
Dask as a Funded Community Project
Dask’s development is distinct in that it has both …
- Been consistently funded to employ many people for years
- Operated completely in the open as a community governed and developed project
Most OSS projects that are built and maintained by for-profit entities feel different from those that are built and maintained by a volunteer workforce (like Numpy, Scipy, Matplotlib, Pandas, Scikit-Learn, and Jupyter for example).
In general, I’ve gotten the impression that the community trusts Dask in the same way that it trusts other PyData projects, despite Dask’s history of funded development. I am personally grateful for this trust. I believe that this is for many reasons:
- We hold discussions and make decisions in public
- We give non-Anaconda/NVIDIA community leaders full commit rights and ownership over decisions
- We engage in ecosystem discussions of protocols and design with other packages and make sure to avoid lock-in APIs
- We contribute to the surrounding ecosystem of packages and are generally good citizens
As an organizer for the Dask project I’ve been on the lookout for how to increase our maintainer pool and how to secure the kind of long-term baseline funding a project needs in order to employ maintainers long term. When I find those sources, I go after them. This is what motivated my move to NVIDIA a year ago and this is what motivates my move now.
A Dask Company Seems Right Today
Today I see large and well funded companies and institutions asking for help deploying Dask, and making it work well within their institution. They’re used to paying for these things (indeed, they often ask me for someone that they can pay for these things) so I figure it’s best to make an organization that they can pay.
There is a lot of demand for scalable Python with Dask today. I think that if we channel this demand carefully that it can have a strong positive impact on Dask, on Python, on open source software, and more importantly on the social and environmental problems that we all want to solve with that software.
(Looking for more? I’ll describe my plans in more depth in a follow up blogpost)