How Dask Helped Model Diseases and Specifically COVID-19
• June 23, 2021
This post will:
- Explain what disease models are
- Explain why High-Performance Computing is important to enhance disease models
- List advantages that Dask has over other High-Performance Computing frameworks
- Report how Dask helped simulation of the first multi-scale COVID-19 model
What are Disease Models?
Disease models used to be based on differential equations, and even today, most of the infectious disease models are based on the SIR model that models the number of individuals in a specific disease state where the disease state can be S=Susceptible, I=Infected, R=Recovered. There are many variations on this model, yet most of those are based on differential equations that can be numerically solved efficiently on one machine in a reasonable time.
The popularity of those models is based on work created early in the 20th century before computers were available and mathematical models were revolutionary. At that time, numerical solutions were difficult and people used logarithmic slide rulers to conduct numerical calculations. Nevertheless, those differential equation models still dominate the publication landscape and many variations of such models were published during the COVID-19 pandemic.
Generally, such population models, often called compartmental models, lack the ability to model individuals since they model quantities of individuals in a specific disease state. This deficiency and the availability of computing power led modelers to seek other types of models such as Agent Based Models or Micro-Simulation models. Those approaches model each individual in the population, and each individual can have parameters beyond the disease state such as age, or location, that can change during the simulation. The modeler typically models how those agents interact with their environment using rules or statistical expressions that depend on their parameters or interaction with other agents. This kind of simulation is time-consuming and often requires repeating the simulation multiple times since each individual has to be simulated. Don’t forget, this number grows with larger populations. Also, each time we repeat the simulation, we may reach other results since those simulations are often based on the randomness of that model and if a specific individual died or recovered before transmitting the disease. So to get an idea of how a simulation behaves on average we need to repeat the simulation many times. In some simulations, the disease may not even spread beyond patient zero, and in others, the disease will be super-spread. One advantage of these models is that we can see a distribution of possible scenarios in the results. However, the price is the intensive use of computing power. The need for computing power rises even further when considering multiple populations, considering the optimization of a model that adds iterations or considering an ensemble model that merges many models together.
The Reference Model and its Technology
The Reference Model for disease progression is an example of such a modern ensemble model. This model uses High Performance Computing (HPC) to run simulations for:
1. Multiple individuals in each population cohort
2. Multiple population cohorts
3. Multiple repetitions of each simulation
4. Multiple models in the ensemble
5. Multiple optimization iterations – each attempting to improve the model
The number of simulations can quickly become unreasonable for one computer to handle on one CPU core. Fortunately, those simulations are often embarrassingly parallel and can be easily assigned to many computers. So HPC can help with those simulations.
The Reference Model uses a simulation engine called the MIcro Simulation Tool (MIST) that has such HPC capabilities. Before MIST was developed, there were attempts to run simulations using PBS/Torque. However, this environment was quickly replaced by the SLURM workload manager that was a much more modern resource allocation system. Both SLURM and PBS/Torque had issues with defining dependencies between jobs, and those had to be shell jobs that were executed in a Linux shell environment. Those systems were aimed to organize many users using a cluster aimed by an organization such as a university. A user outside an organization that owned a cluster had to find other solutions, and write their own code to launch thousands of jobs and orchestrate those on more than one computer.
The Reference Model started with modest computing requirements that could be executed outside a large organization on one strong computer with 8 cores. Once simulations became unmanageable on one machine with 8 cores, the first solution was to execute a MIST simulation on the cloud using 20 machines for a total of 160 cores using StarCluster. StarCluster used Sun Grid Engine (SGE) , which was a very nice system that had a nice GUI to support installation. This was an easy HPC solution to install that could be quickly deployed on a local cluster of machines and on the cloud. It still had issues and required launching jobs from the operating systems, yet it was a good solution that MIST used for several years since it allowed running simulations in the cloud and on a local cluster without changing code.
However, this solution was not supported effectively on Ubuntu operating systems that were used by the author to create a local cluster. Meaning that the cluster had to remain on Ubuntu 12.04 to work and no solution was provided for newer versions of the operating system.
How Dask Helped and its Advantages
After a few years, a solution was required that would be much better and I remembered a presentation on Dask and decided to ask a question:
Perhaps the question was ill posed since the initial answer from Dask creator was “No”.
And indeed, Dask cannot replace all the things a system like SLURM is doing to allocate resources to many users. However, for the use case that I had where I needed to launch many Python jobs, it was a superior solution. After a few changes in the code, it was very easy to launch simulations on a cluster with the following advantages:
1. Only Ubuntu and Anaconda had to be installed without any resource management system like PBS / SLURM / SGE which requires difficult configurations.
2. It allowed running code on multiple operating systems by just installing the Anaconda Python distribution. It allowed running simulations on a Microsoft based development machine. And it worked almost the same way on all the machines.
3. Dask provided a nice web browser based monitor to follow the simulations without any special installations.
4. It allowed passing job dependencies within the Python code – this required very complicated shell scripting with other systems.
5. It allowed quick destruction of bad simulations which could take hours on SGE.
So, Dask was superior to a resource management system in a situation where there was only one user involved that needed a lot of resources. And indeed Dask was used for several years to run simulations and supported the most validated diabetes model known. For several years simulations were conducted on different machines with 4 cores, 8 cores, a local cluster with 16 cores, and a server with 64 cores. In all cases, the installation was trivial and the benefits were enormous.
Dask and the First Multi-Scale Ensemble for COVID-19
The true benefit of Dask became evident when the COVID-19 pandemic started and simulations needed to be executed on a much larger scale to model the United States with over 50 US states and territories from information within The COVID Tracking Project. To properly simulate COVID-19 in those US states, there was a need to repeat many simulations and models for many US states.
The Rescale Cloud provided credits from Amazon and Azure to run a simulation on 20 nodes of 40 cores for a total of 880 cores for 9 hours – almost a year of computation on one core. Dask proved extremely useful since it adapted easily to the new cloud environment with very little effort. The ability to self-deploy on a cluster using the `dask-ssh` command made things so much easier. Moreover, it allowed optimization of computation by sending Python modules to the different cluster nodes. The results were published in Cureus.
However, Dask proved itself to be even more resilient when a new ensemble model of COVID-19 was created and was executed on the Rescale Cloud using Amazon AWS credits and then on the High-Performance Computing Environment provided by the MIDAS Coordination Center, supported by NIGMS grant 5U24GM132013 and the NIH STRIDES program. The largest simulation executed on the MIDAS cloud used 36 nodes of 32 cores each for a total of 1152 cores for almost 49 hours. This simulation would take approximately 6.6 years on one core. Incidentally, the MIDAS cloud was provisioned using SLURM that allocated the nodes on the scaling cloud. So Dask worked on top of SLURM in this case. The time it took to deploy Dask in those different environments was reasonable and deployment was easy. Simulation results and detailed descriptions are published in this interactive paper.
The model output can compute best fitting infectiousness and mortality.
Currently Dask is an important engine that MIST uses to execute:
1. The most validated diabetes model known worldwide.
2. The only multi-scale COVID-19 ensemble model known.
Without Dask, executing those simulations would have been an ordeal involving many people. Once Dask appeared, all this was possible to be accomplished by mostly one person with little technical support. This is only a glimpse of what will be possible in the future.
Thanks for reading
We hope you enjoyed this post outlining an impactful Dask use case! Coiled extends a big thank you to Jacob Barhak for taking the time to write this article and contribute to the blog. You can find out more about Jacob and his work as a multidisciplinary researcher and developer on his website.
To learn more about Dask, click the button below.