Glad you’re excited for Part 2
Hopefully, you have read part 1 of this two part series and are here and ready to checkout the modeling process for our project. Let’s dive into it.
In order to predict the number of COVID-19 cases in each county in California, we decided to build 3 models: a Long short-term Memory(LSTM) neural network using Tensorflow, an ARIMA model using pmdarima, and a SARIMA model using statsmodels. As neural networks tend to overfit, we solved part of this issue by using dropout layers, L2 regularization in our hidden layers, as well as early stopping. The neural network structure we decided to go with was 5 layers in the following order: A 128 node LSTM input layer, a 64 node LSTM layer, a 32 node dense layer with an L2 regularization of 0.001, a 16 node dense layer with an L2 regularization of 0.001, and finally an output layer of 1 containing the predictions of the model. Each hidden layer had a dropout layers of 0.2. One additional issue with using an LSTM neural network was we were only able to use one feature to generate our predictions, so we decided to use the number of hospitalized COVID-19 patients to predict the number of cases as the 2 are highly correlated. Also, as we had to do modeling for all 58 counties, we did not gridsearch any hyperparameters to improve our models. Using the following topology we were able to achieve a root mean squared error of 8.51 for our training data and a root mean squared error 17.39 for our testing data. Below is a graph of the training vs testing loss of our model over 124 epochs for Los Angeles county.
Of course, one of the challenges we faced while trying to make predictions for each county is being able to predict drastic changes in the number of COVID-19 cases in a county. A perfect example of this is LA County where we can see that our model predicts early data quite well, however it is late on the spike between November and December and doesn’t adapt well to the sudden decrease in the past few datas as seen below.
In the case of our ARIMA model, we used the auto arima function to automatically pick the best parameters for us. We found that the best parameters were 0 time lags for our p, subtracting 2 times from past values of our data for d, and a moving average of 2 for our q. Here we have our diagnostics for our ARIMA model. We see with the residuals that they are fairly steady in the beginning but there is a lot more varience in our forcast as time moves on. For our Q-Q plot, optimally we are looking for a straight line which we almost have, however the linearity is problem at the extremities of the plot. Of course, the story is similar for our histogram plus estimated density and correlogram where we see that we can capture about 70-80% of the data accurately, but can’t really predict extremes.
Unfortunately, we did not see much improvements between our LSTM neural network and ARIMA as our predictions and the actual results are still differ a lot as can be seen below using LA County as an example again. Also the root mean squared error is 22.89 for our testing set using ARIMA, so it is slightly worse than our LSTM model.
Next we moved on to testing our a SARIMAX model to see if adding in a seasonality factor to our time series model could improve it. Again, we tested a varity of parameters for p, d, and q to see which parameters would be the best to use. In this case our parameters are slight different from our ARIMA model where our p, d, and q are 0, 1, and 1 respectively. Looking at the diagnostics for our SARIMAX model, we again have a similar story compared to our ARIMA model. In most cases, it unfortunately looks worse that our ARIMA model as can be seen below.
Again we see that our predictions are much worse than the actual values, and actually do worse than both ARIMA and LSTM. Our RMSE when predicting cases for LA County was 41.43 using SARIMAX.
In the end, we see that LSTM was able to predict one of the counties with the most varience the best so we decided to use it to predict the number of COVID-19 Cases in every county using LSTM.
Obviously, looking at these predictions for 58 different counties and making a decision isn’t easy, so I’ve decided to display these predictions graphically on a map: Click here to see the map
If you want to see in depth how all our models were built, as well as our LSTM model for all 58 counties check out the following notebooks:
Generating the Recommended Vaccine Distribution Numbers for each County
After building our models to predict the number of COVID-19 cases in each county, we did some estimations on the number of vaccinations that could be administered per day. After this we built a model that creates recommendations of the number of vaccines that should be distributed to each county for the week. A table for the week of January 31th, 2021 can be found here.
To sum up, after a thorough analysis of a variety COVID-19 Data in California and vaccine distribution, we were able to build a model that recommends how vaccines for the virus should be allocated by county based on future outbreaks rather than just population. However, implementing this model might not be wise due to the inaccuracy of our predictions for locations of COVID-19 vaccinations. We notice that forecasting is extremly difficult and even top institutions have a difficult time forecasting the pandemic using much more data. Using this model could lead to a county receiving too little or too much of the vaccine, which is very risky when people’s lives are at risk.
Another thing we would want to potentially question with our model is the ethics of it. Can we justify using a pure mathematical model that ignores all human emotions? Should there be a priority queue for certain people to get the vaccine, such as the elderly and healthcare workers? Of course, given more time we would like to implement these things as well as others. You can find ways we thought of to improve our model here.