3 Things I Learnt at the Dask Summit
• June 1, 2021
The third week of May’21 was an exciting week for Coiled. We announced the general availability of Coiled Cloud, launched our new brand, announced Series A funding, and participated in the Dask Summit – something we’ve been looking forward to since last year.
Dask Summit is a gathering of Dask users, contributors, and in general, Dask Enthusiasts. Everyone comes together to share knowledge and collaborate. This year, it was a remote gathering, and it was called the Dask Distributed Summit. ( I really appreciate the pun there. 🙂 )
You can learn about the event, sessions, and more at summit.dask.org.
This was my first Dask Summit and I loved so many things about it! This blog post highlights a few things I took away from the event.
Social space gather.town for fireside chats at the Dask Summit
Did you attend the Summit this year? We would love to hear about your experience! Let us know on Twitter @CoiledHQ.
Dask is helping communities make meaningful advances in GeoSpatial Imagery, the Finance Sector, and Astrophysics, among many others.
We know that Dask is used around the world across many different domains, as we have discussed in our blog, Who Uses Dask?. There was still something really impactful in seeing all of these use-cases first hand, and in one place. The Summit included presentations and workshops by geospatial researchers, financial data analysts, life science researchers, particle physicists, astrophysicists, and so many others.
The Geospatial workshop started with a demo of dask-geopandas and spatialpandas. These are powerful libraries that extend the capabilities of pandas and Dask to geospatial datasets. It was followed by a talk on Datashader, a data visualization library capable of handling large datasets.
After these, the conversation moved towards use-cases, and one specifically stood out to me. Stefanie Lumnitz shared how the European Space Agency (in collaboration with NASA) uses Dask-powered geospatial tools to study patterns of forest structures. This study is essential to understanding forest biodiversity and its impact on climate change.
Source: Scalable Geospatial Data Analysis with Dask, Dask Distributed Summit 2021
The Finance workshop included sessions by teams from Capital One, Barclays, and Two Sigma. They discussed how all of them implement Dask in their pipelines. The best parts of these workshops were the discussion at the end. It was nice to see everyone using Dask in a specific domain come together to share their workflows, challenges, and solutions. I found this dialogue to be incredibly helpful. For example, at the end of the Finance workshop, there was a discussion on how Financial institutions can better support and encourage open source contributions. All workshop participants understand the industry dynamics around this problem, which led to valuable discussions that would not have been possible at a general gathering.
I also loved Michael Wood-Vasey’s talk on Dark Energy with Dask: Analyzing data from the Next Generation of Large Astronomical Surveys. Dark Energy is the energy that scientists think is responsible for the accelerated expansion of our Universe. With the help of Dask and Dask-powered visualization tools, scientists at the Vera C. Rubin Observatory LSST are studying terabyte-scale images of the Universe to learn more about Dark Energy. As someone who enjoys hobby-reading about astronomy, this was one of my favorite talks at the Summit!
Dask’s Internals are really interesting
Dask is a large project. It abstracts away a lot of tiny details to provide an effortless experience for its users, but there might be cases where having a deeper understanding of Dask can be helpful. This is especially true if you’re using Dask extensively in your workflows and the smallest performs gain have a compounding effect. I love that there is always so much more to learn about Dask, and I enjoyed seeing multiple sessions that focus on demystifying parts of Dask and helping participants level up their Dask skills.
My favorite workshop at the Summit was Hacking Dask: Diving Into Dask’s Internals by Julia Signell and James Bourbeau. They covered topics including advanced Dask collections, graph optimizations, and the distributed scheduler in detail. A lot of the things they discussed were completely new to me! I think this workshop is a must-watch for everyone.
The workshop on Deploying Dask was also very interesting. I got a chance to learn how to deploy Dask on Kubernetes clusters, Hadoop clusters, and natively using dask-cloudprovider. In the talk on The Dask JupyterLab extension, Ian Rose demonstrated how to create custom layouts for Dask’s diagnostic dashboards, how to use the integrated Dask cluster manager, and what the team is working on next. Visualizations are fun!
Source: The Dask JupyterLab extension, Dask Distributed Summit 2021
Key Factors that Contributed to the Success of Scientific Python Ecosystem
The Scientific Python ecosystem consists of libraries including NumPy, pandas, scikit-learn, Dask, matplotlib, among many others, interoperating seamlessly. The very first bricks to build this giant ecosystem were laid by Peter Wang and Travis Oliphant over a decade ago. You can learn more about how this came to be in the keynote Humble Scoping: The DNA of Open Innovation in SciPy and PyData.
Peter talked about how the original creators of these libraries did not begin by wanting to build all the tools themselves, they just wanted to solve a very specific problem. Then, they built interfaces for others to develop any required solutions/capabilities around their library. This “minimum viable scope” or “humble scope” approach is at the foundation of our ecosystem.
Travis noted that at the beginning of the talk, they needed people who are not only passionate about the project but who are also willing to do the work. Peter and Travis discussed the origins of Conda, NumPy, SciPy, and Dask in specific. They also chatted about the importance of community and a platform for the projects to work together, which led to the formation of NumFOCUS.
Source: Keynote by Peter Wang and Travis Oliphant, Dask Distributed Summit 2021
Talking about the history of Dask, Travis says: “It was a labor of love and a 20-year vision”.
At that time, Spark was the major tool for scaling, but it pushed people towards using the JVM. Matthew Rocklin came up with some great ideas for a Python framework for distributed computing, and the team at Anaconda decided to fund this effort. Peter and Travis reflect on how finding passionate individuals and then supporting them leads to the emergence of important projects like Dask.
Thank You for a Wonderful Summit!
Organizing a remote conference is challenging, but the Dask Distributed Summit organizers did a fantastic job!
- They covered multiple time zones, including Asia/Pacific timings. This meant I could attend some live sessions, even while at home in India!
- All sessions were recorded and were available to watch at the end of each day. I’m yet to watch some sessions, but I can now do so at my own pace.
- The social spaces and trivia games added an element of fun and a unique touch to the conference.
On behalf of the entire Coiled team, I’d like to extend a huge thanks to everyone who helped organize this conference – Dask contributors and maintainers, Summit volunteers, sponsors, the amazing team at NumFOCUS, all the speakers, and everyone who participated.
The Dask Summit was very special for us, here at Coiled. We love Dask, we contribute to it regularly, help maintain the project, and work to make Dask accessible to everyone. Keeping this in mind, we’re building Coiled Cloud to make it easy to deploy Dask on the cloud. You can check it out below: