It would be a tough sell to say that anyone worked harder around the clock than the news stations and hosts that covered the U.S. Presidential election starting on Tuesday, November 3rd, and in the following near-week while the world awaited the final results. We seriously need to know what the training regimen was like for those folks leading up to the election (and what kind of coffee they were drinking). But…what else was training hard before the 2020 U.S election?
The machine learning models predicting voter turnout, of course! The COVID-19 pandemic completely transformed the way the United States elected its next President in the 2020 Election. As a result, the process for predicting voter turnout had to be completely transformed as well.
Andrew Therriault, Ph.D., a Data Scientist and founder of Civin, answered the call when he created a series of machine learning models used by Bloomberg news to forecast voter turnout in the 2020 U.S. elections. The Coiled team had the opportunity to connect with Andrew and discuss how he leveraged Dask and Coiled Cloud to get the job done.
In this post, we will:
- Contextualize the need for ML in voter prediction for the 2020 election
- Summarize the methodology used to create Andrew’s model
- Outline how he used Dask and Coiled to get the job done
- Discuss Bloomberg’s predicted voter turnout and the results of the 2020 election
Biden vs Trump, Voter Turnout, and Machine Learning
Though numerous factors contributed to the 2020 U.S. presidential election being different than its predecessors, nothing transformed it structurally more than the coronavirus pandemic. COVID-19 led to many states making changes to the voting process, especially when it came to early voting, both in-person and by mail. More than 100 million voters cast their ballots by mail or voted early in the weeks leading up to Election Day, more than double the previous record set in 2016. In most years, media outlets track vote-counting progress by looking at the number of Election Day voting districts that have reported results so far, but that isn’t a useful measure when most votes were cast ahead of time. With this in mind, Bloomberg asked Andrew to leverage his expertise and create a voter turnout forecasting model that estimated how many votes were expected in the election. These estimates were then used to give context for the vote counts being reported on election night and throughout the resulting election-week that ensued.
Anatomy of a Voter Turnout ML Model
Andrew modeled historical presidential election turnout rates (the proportion of voting-age citizens casting ballots) at the county-level from the 2004, 2008, 2012, and 2016 elections to generate Bloomberg’s 2020 turnout forecasts. These estimates pulled from each county’s demographic characteristics, voting rules, the competitiveness of the races, and turnout rates from past elections, all of which were combined using three different model algorithms:
- gradient-boosted decision trees,
- random forest regression, and
- extra-trees regression.
The predictions of each algorithm (using the latest data available for the 2020 election) were combined to form an average “ensemble” prediction that was more accurate than any individual prediction could be, as each algorithm handles different types of data differently and has its own strengths and weaknesses. These predictions were then re-aggregated to provide forecasts at the state and congressional district levels as well as the county level. Finally, the presidential forecasts were combined with models of down-ballot “drop-off” to produce more specific forecasts for Senate, House, and gubernatorial elections.
Dask, PyData, and voter turnout prediction results
Where did Coiled come in? Andrew first attempted to build the model using the PyData stack locally (he was running models in scikit-learn and using cross-validated random hyperparameter searches for model tuning), but he ran into time bottlenecks. “It was a quick turnaround project with a lot of iteration,” Andrew told us. “As I got more data and tried out new features in the weeks leading up to election day, I moved the computation over to Coiled from my local machine so I could re-tune and refresh my models quickly.” Though Andrew only had to swap a few lines of code to get Dask scaled out with Coiled, it made his workflow “about 50x faster”. Incredibly powerful, but as Andrew noted, fantastically underwhelming to look at. Scaled compute so easy it’s boring; a sentiment aligned with Dask’s goal from its inception to invent nothing.
Check out Andrew’s tweets from his experience scaling out with Coiled:
When all was said and done, Andrew’s baseline forecast predicted that 63.8% of voting aged citizens would turn out in 2020, assuming overall turnout levels were what we would expect in a “typical” year. This predicted turnout rate higher than turnout in 2004, 2012, and 2016 but lower than in 2008. Since they knew the 2020 election was not going to be a typical year, though, Bloomberg also included “high turnout” and “historic turnout” forecasts, which estimated what turnout would be in each state, county, and district if the overall turnout levels were substantially higher than in previous years. Early on election night, Bloomberg switched from the baseline forecast to the “high turnout” forecast, after early results from states like Florida showed much higher than normal turnout levels. Just before dawn the next day, the forecast was again changed, this time to the “historic turnout” forecast, after many states and counties reported turnout levels in excess of the totals predicted by the “high turnout forecast”.
As of Monday, Dec. 7th, Bloomberg reported that about 158.2 million votes have been counted. For comparison, the final Bloomberg forecast based on Andrew’s models gave an expected total between 157 and 165 million votes, so the final numbers matched the model’s predictions very closely. And in swing states, the models performed particularly well: the final results were within 1% of the forecast total in many key states such as Michigan, Georgia, Florida, and North Carolina. Overall, this project showed how machine learning can be a highly effective tool in forecasting election results, even in such an unpredictable year as 2020.
You can read the latest on the vote-counting here. For a deeper look at the methodology behind Andrew’s models, check out Bloomberg’s Explaining the Bloomberg News 2020 Election Turnout Model.
Turbocharge your data science
Thank you to Andrew for taking the time to share this Coiled use case with us! Be sure to give him a follow on Twitter for some excellent data science AND corgi picture content.
You can check out Coiled Cloud for free when you sign up by clicking below.