Creating a Better COVID-19 Vaccine Distribution Strategy: Part 1

5 minute read


I recently worked on a project with Alyssia Oh, Jesse Tao, Letty Wu, and Rahul Parab that attempts to improve COVID-19 vaccine distribution in California by distributing vaccines based on areas where COVID-19 is more widespread rather than just on population. Currently the COVID-19 vaccines developed by Moderna and Pfizer are being distributed based purely on population which is a huge problem in cases where an outbreak happens. As the COVID-19 vaccine comes in 2 doses that have to be taken 3 weeks apart, we would like to predict areas that are most likely to have an outbreak in 3 weeks time and allocate more vaccines accordingly. By doing this, we can reduce the spread more effectively rather than assuming that outbreaks are likely to only happen in areas of high population.

Definitions for COVID-19 outbreaks are relative to the local context. A working definition of “outbreak” is recommended for planning investigations. A recommended definition is a situation that is consistent with either of two sets of criteria: During (and because of) a case investigation and contact tracing, two or more contacts are identified as having active COVID-19, regardless of their assigned priority. OR Two or more patients with COVID-19 are discovered to be linked, and the linkage is established outside of a case investigation and contact tracing (e.g., two patients who received a diagnosis of COVID-19 are found to work in the same office, and only one or neither of the them was listed as a contact to the other).

In this part of this two part series, we will go over the Data Cleaning and EDA of our project. Part 2 will cover the Modeling as well as the conclusions we came to for the project.

Data Cleaning

As we know all data is messy, we want to do some cleaning before we move on to EDA and modeling. To do this we are used the pandas library in Python to read in the csv files we collected and cleaned the data by filling in null values, as well as merging datasets together. Another thing we did was web scraping the CDC site that tracks the number of vaccines administered per state using a cron job as there was no dataset that contained the daily values for this. The code used to web scrape is shown below.

# Import libaries here
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from import ChromeDriverManager
from time import sleep, time

# initialize what site we want to scrape
home_url = ''
browser = webdriver.Chrome(ChromeDriverManager().install())

# open browser and get html code 
html = browser.page_source

home_soup = BeautifulSoup(html, 'lxml')

# close browser

states = home_soup.find_all('tr')

# put the info we need in a dictionary
states_info = []
no_header = states[1:]
ind_state_info = [{'state': j.find_all('td')[0].text, 'total_doses_distributed': j.find_all('td')[1].text,
                      'total_doses_administered': j.find_all('td')[2].text, 'doses_distributed_100k': j.find_all('td')[3].text,
                       'doses_administed_100k': j.find_all('td')[4].text, 'people_1+_dose': j.find_all('td')[5].text,
                       'doses_1+_100k': j.find_all('td')[6].text, 'people_2_doses': j.find_all('td')[7].text,
                       'doses_2_100k': j.find_all('td')[7].text} for j in no_header]
states_info += ind_state_info

# save our important info to dataframe and save as csv
final_info = pd.DataFrame(states_info)
final_info.to_csv(f'../data/vaccinations_{int(time())}.csv', index = False)

Here are some links to notebooks of the data cleaning process we went though:

Cleaning the vaccine and mask data Cleaning population and vaccine data Cleaning the web scrapped data Cleaning the vaccine allocation data Cleaning the hospital data

EDA and Visualizations

After cleaning all our data, we moved on to EDA and making visualizations to see what features we can use to generate our model to predict COVID-19 outbreaks. The first thing we decided to look at was the vaccine distribution and adminstration per state to visualize our problem statement. Below we can see that distribution from the US government to individual states is based on population other than for Alaska.


Here is a visual that where we have the total vaccinations for the top 10 states and their populations on top.


Next we wanted to see how many vaccines are actually being administered to people once the state obtains the vaccines. Here we see that for California, less that half of the vaccines have been administered to people, while Michgan has administered 70% of their vaccines.


To see why the administration of vaccines was so slow, we looked at the number of people that had received a vaccine from January 16th. Interestingly, from January 16th to January 20th we don’t see any change, and then we see a sudden spike afterwards. One thing that is concerning though is the number of people that have received their 2nd vaccine is lower than expected.


Afterwards, we looked to see if a survey conducted about mask use in each county was indicative of an area being more susceptable to COVID-19 outberaks. This turned out to not be the case as the discrepency between counties was very small and wasn’t helpful. As many outbreak statistics are based on a per 100k population basis, we decided to scale our data the same way. Below are some plots showing the 7 day rolling average of newly confirmed cases and new deaths per 100k population.

Newly Confirmed Cases Equation5

Deaths Equation6

We then moved on to looking at the hospital data and seeing how similar it was to the cases data. While the data doesn’t line up perfectly with the cases data, it still follows a similar trend which we expected as even though someone might get COVID-19 they don’t neccessarily have to stay at the hospital if their symptoms are minor. Below are charts showing the hospital data for each county, based on the total number of available hospital beds as well as the total number of available ICU beds.

Total Hospital Beds Equation7

ICU Beds Equation8

Finally, we looked at the total number of cases based on population density in each county and plotted a map of it as shown below.


If you want a deeper dive into our EDA process you can check out the notebooks below.

Cases and Mask Use EDA

Cases and Mask Use EDA

Hospital Beds EDA

Vaccination Distribution by State EDA

Vaccine Administration in California EDA

COVID-19 Density by County EDA

That’s It For This Part

If you would like to check out part 2 of this series click here