Post #1101: New statistical analysis: When the facts change I change my mind. What do you do?

Posted on April 7, 2021

This is a quick redo of the analysis of the analysis of post #1092.  That was a simple regression using factors such as vaccination rates and prevalence of the new COVID-19 variants to try to explain recent trend in daily new COVID-19 cases.  The data points were U.S. states.  If you want the CAVEATS and the background, look back to that post.

The brief summary is that once you have enough states to look at, the picture snaps into focus.  It’s no longer muddy.  It’s exactly as it has been described by our public health establishment.  With (now) 40 states’ worth of data, it certainly does appear that we’re in a race between the new, more-infectious COVID-19 variants, and overall immunity of the population via prior infection and immunization.

The major change between that analysis and this one is sample size.  In the past few days, CDC published updated information on the incidence of the new COVID-19 variants.  Before, they showed it for just 17 states.  Now they’ve updated that information by a couple of weeks, and published it for 40 states.

And the results of a simple regression analysis now seem vastly clearer and a lot closer to what you’d expect.  Now, with a much larger sample of states, and more recent data, the standard story that is being told about the U.S. fourth wave of COVID simply pops out of a simple statistical analysis.

First, with 40 states in the picture, there’s a strong correlation between the U.K. variant and growth of new cases.  That’s almost surely due to the inclusion of Michigan and Minnesota in the analysis, as those are the only two states that you might readily characterize as “having an outbreak”.

Secondarily, but reassuring, where I have information for all 50 states (second line), the correlations there look much like the correlations for the subset of 40 for which I have the CDC data.  That suggests that the 40 states provide a reasonably good representation of the full 51.

Second, when I put these key factors into a “regression analysis”, to try to assess their independent effects, the results begin to mirror the conventional wisdom about the fourth U.S. COVID wave.  It’s a race between more infectious variants and immunity.

If you don’t do this sort of thing professionally, I’m sure that looks like a meaningless jumble of numbers.  (And, as with any statistical analysis of observational data, it may in fact be just that).

Let me try to walk you through what this says.

First, not shown on this chart, the thing I’m trying to explain is the month-to-month change in new COVID-19 cases per capita, by state.  In effect, I’m contrasting states that saw their new case rates go up, compared to one month ago, and states that saw them go down.

Here’s a bunch of technical detail on how this works.  Let me minimize this because nobody every cares about the methodology.  I’ll do that with all the fine points of methodology below.

The “coefficient” column shows, in effect, the correlation between each of these factors that that new-case trend.  A positive number means that the factor is associated with higher new-case growth rates, a negative number means that it’s correlated with lower growth rates.  And because this is “a regression”, each factor is adjusted for the presence of all the others.  E.g., if high-U.K.-variant states also happen to be cold-climate states, those two factors are now, in theory, sorted out independent of one another.

For most of these factors, as I’ve put this together, you can’t directly compare the size of most of the coefficients to one another.  So, pay attention the sign (positive or negative), not the size.

Then there’s “statistical significance”, as measured either by t-statistic or p-value.  All this does is filter out the noise.  If something is “not statistically significant”, traditionally taken as a p-value of 0.05 or higher, then there’s too great a chance that you could see a number like that just by chance.  Too great a chance that it’s “noise”, not “signal”.   And so you only really pay attention to variables where the p-value is 0.05 or less.  I’ve highlighted those in red.

To be clear, that doesn’t mean that those factors are in fact causing variation in the new-case trend.  All it means is that the correlation is clear enough that it probably didn’t occur merely by chance. 

(For example, if states with big outbreaks really push to get people vaccinated, you’ll see a big positive correlation between vaccination and new-case trend.  That doesn’t mean that vaccination causes COVID-19 cases.  It means that the level of vaccination was responding to the COVID-19 outbreak.)

Finally, let me be clear that this is hardly a “robust” analysis.  There’s a good chance that if you (e.g.) found another significant factor such as propensity to wear masks, and put that in here, all of these coefficients would change some.  But with 40 states now in the picture, it now looks like you can pick up a strong effect of the U.K. variant, and all of the effects have “the right sign”.  So I’m forging ahead. 

Finally, for those who care, the adjusted R-squared if this regression is about 0.66.  Something like two-thirds of the state-to-state variation in the change in new case rates was captured by these five factors.

And here’s how I read the results.

First, the U.K. variant was a strong predictor of an upward trend in COVID-19 cases per capita over this period.  But, outside of this statistical analysis, there is sound clinical and other evidence that this should happen.  (In the jargon, you have a theory that determines the path of causality in this case).  So, that background, plus the very low likelihood that you’d see this strong a correlation merely by chance (p < .001), gives you pretty good confidence that this is a real effect.

Second, there’s no such evidence for any impact of the California variant.  This also seems to match the background theory, because, at best, the increased infectiousness of that strain was half that of the U.K. strain.  So the available theory said that California should matter less, and the data says, it doesn’t matter at all.  There’s no statistically significant correlation between the incidence of the California strain and an upward trend in new COVID-19 case.

Third, note that there’s now an extremely strong effect of state-average mean temperature.  The negative coefficient says that the warm-climate areas of the country are, all other things equal, having much lower increases in daily new cases than cold-climate areas.

The reason this is now such a pronounced effect is that this is almost certainly the regression’s way of accounting for the difference between Michigan+Minnesota, versus Florida+Texas.  All of those states have high incidence of the U.K. variant.  The regression would otherwise predict high rates of new-case growth in all of them.  And it accounts for that North/South split by saying, O.K., warm climates seem to have lower rates.  That was true in the earlier version, but is much more pronounced here.

In theory, I should directly run a “cross term” here, including both temperature and variant incidence as a separate explanatory variable.  But I’m doing all of this in Excel, and I just don’t want to go to the trouble to gin all that up by hand.  This is good enough as-is to make the point that warm-climate states appear systematically different from cool-climate states.

Finally, there’s “immunity”, which is represented here by the fraction vaccinated against COVID-19, and the fraction that already tested positive for COVID-19 at some point.  Here, the signs on the coefficients now both point in the “correct” direction (more vaccination and more prior infection both predict lower rates of new-case growth), but the effects are much more tenuous than for the impact of the U.K. variant.

The fraction already infected has a statistically significant effect (p < .05), but the vaccination rate does not.   I think that more-or-less makes sense.  During the time period on question (between March 1 and April 1, more or less), people who were immune via prior infection was a vastly larger piece of the population that people who were immune via vaccination.  Particularly given the lag between vaccination and development of full immunity.  Let me see if I can dig up a “herd immunity” chart circa March 15:


At that time, somewhere around three-quarters of persons with any immunity to COVID-19 had that immunity due to prior infection.  Toss in the fact that the vaccination rate is muddled because it’s strongly affected by state policy (little rural states were better at it than large urbanized states).  Add in the lag between vaccination and full immunity.  And the upshot is that it’s no particular surprise that, at that time, the apparent impact of vaccination was much more tenuous than the impact of prior infection rate.

There’s a second, extras-for-experts reason that the fraction already infected might have a lower-than-expected association with the new-case growth rate.  It’s a classic example of an “errors-in-variables” problem.  What actually should affect the growth rate is the true fraction of people who have already been infected.  What we have to work with, by contrast, is the fraction who have tested positive.  The truly-infected number is some large multiple of the tested-positive number, perhaps five-to-one.  But that ratio almost certainly varies state-to-state, depending on (e.g.) state testing policy.  So the true number is more-or-less a multiple of the number we can actually get, plus a potentially large random “error”.  The presence of that large “error” in our “percent already infected” estimate will reduce the correlation between that and the new-case trend.  In theory, it will introduce some bias of unknown sign on that variable, and potentially on the other variables in the equation as well.  In practice, every time I’ve run across it and had some way to check it, the practical impact of a significant errors-in-variables problem was to bias the coefficient in the variable in question toward zero.  In other words, we’re probably look at too small an impact, of the prior-infection rate, because we’re using that persons-tested-positive variable as our proxy for the true-but-unknowable persons-actually-infected variable.

My conclusion is that what may look like a meaningless jumble of numbers actually captures most of the stylized facts of the current situation.  Once you have 40 states in the picture, the U.K. variant appears to have a huge impact — so long as you can separate out the warm-climate states from the cold-climate states.  The California variant doesn’t matter.  And the impact of immunity is beginning to be felt.  Mostly, up to now, immunity via prior infection.  Vaccination, as of mid-March, didn’t much matter.  But you can have a strong expectation that, as we move forward and the fraction immune via vaccination increases, that vaccination coefficient will rise and become a statistically significant predictor of lower rates of new COVID-19 cases.

Probably the least-defensible part of this analysis is the inclusion of mean state temperature.  There isn’t really a strong, well-accepted theory behind that.  But I think the past (and to some degree, present) clustering of trends by region suggests that there might be something there. After all, the U.S. second wave was mostly limited to hot-climate high-AC-use states.  The U.S. third wave clearly started in the cold, dry-winter-climate Midwest and Mountain areas.  And there is at least one potential theory behind that (Post #894, Post #895).

Source:  The American Society of Heating, Refrigerating and Air-Conditioning Engineers.  This is from the 2016 ASHRAE Handbook—HVAC Systems and Equipment (SI), Chapter 22:  Humidifiers.

So it does not see far-fetched that climate might play some role.  That said, if I remove the temperature variable, I get qualitatively similar results, but smaller in magnitude and with lower levels of statistical significance.

So, differentiating between the warm-climate and cold-climate states certainly sharpens up the analysis quite a bit.  (E.g., the adjusted-R-squared is 0.49 without temperature included, 0.66 with temperature included.)  Based on purely statistical criteria, you’d have to include that.  But that doesn’t prove that it’s actually correct to do so.  I’d appeal to the chart above before I’d include it simply based in “explanatory power”.

In short, now that most of the data is in hand, it really does look like this is a race between vaccination and the new COVID-19 variants.  And I now have to take back everything I said about there being no evidence to support that.

I’ll leave you with one of my favorite quotes, supposedly from an economist:  “When the facts change, I change my mind.  What do you do?”