Post #933: The CDC’s eight-times estimate.

Posted on January 1, 2021

This is a post about the count of U.S. residents who have had a COVID-19 infection, but were never formally diagnosed.

To get to the bottom line, the answer is, lots.  There are lots of people who have had a COVID-19 infection but have not been diagnosed.  The number of actual COVID-19 infections is several multiples of the cases that have been formally diagnosed.  But if you want much more accuracy than that, I think you’ll be disappointed.  Particularly if you want a state-level estimate, instead of one for the U.S. as a whole.

Why does this matter?  You need to know that before you can start talking about herd immunity. Presumably, most or all of the people who’ve had an infection are now immune, and won’t act as disease carriers.  Together, they are the group that will generate herd immunity to COVID-19.

In recent posts, I’ve been talking about herd immunity by sloppily applying a single U.S. estimate (from CDC staff) to all states.  By report, CDC staff have estimated that there are about eight actual COVID-19 cases for every one that’s been formally diagnosed.

Now I have a few questions about that estimate.  Did CDC staff actually say eight-to-one? (No, you actually have to calculate that yourself, from their paper, as about 7.7 total infections for every person with a positive test.)  How, exactly, did the CDC come up with that?  Are there other equally credible estimates?  How much does that vary across states and over time?  Does that include asymptomatic cases or not?

We’re not looking for pinpoint precision here.  The whole herd immunity concept is fuzzy when applied to humans, and this disease.  Plus, the goalposts for herd immunity are now moving (Post #928).  So I’m really looking for some information on just how gross an error am I likely to make by taking that one US CDC estimate (through September 2020) and simply slapping it onto the COVID-19 counts for every state.

I’m sure this is dry as dust to the average reader.  But I got a lot of interesting surprises.  That CDC estimate isn’t at all what I thought it was.  It’s not based on blood tests for COVID-19 antibodies, but instead is a cobbled-together estimate based in large part on (e.g.) voluntary reporting of flu symptoms on a single website.  (Say what?) Separately, the CDC also produces estimates of COVID prevalence by testing blood for COVID-19 antibodies, but they won’t produce any sort of national number from those.   Even though they painstakingly do those in all 50 states.  Again, say what? 

This whole mini-quest turned out a lot harder and a lot weirder than I expected.  I thought I’d share what I found.

Background, including relevant information on flu.

A lot of U.S. residents have had a COVID-19 infection, but were never formally diagnosed with COVID-19. These people don’t show up in any official counts of the number of COVID-19 cases.    Based on news reports, a November 2020 publication by CDC staff estimated that the ratio of actual to diagnosed COVID-19 cases was eight-to-one.  As of September 2020.

This finding — that total cases is vastly higher than diagnosed cases — is not unique to COVID-19.  For example, in any given year, the CDC’s estimate of total flu cases is typically a large multiple of the count of laboratory-diagnosed flu cases.  Just to given an example, CDC estimated that the ratio of actual cases to lab-diagnosed cases for the 2009 H1N1 flu was somewhere around 79-to-1.

That pool of undiagnosed COVID-19 cases includes people who were completely asymptomatic (never had any material symptoms).  It includes people who had symptoms but didn’t seek medical care and get tested.  Or they sought care, but didn’t get tested.  It also includes a fairly large number of individuals who got tested, but did not get diagnosed with COVID-19 due to the high false negative rate of the nasal swab/PCR test (see Post #859).

Again, there’s nothing unique in this regard about COVID-19.  All of those factors apply to seasonal flu as well.  On average, 16% of flu infections are asymptomatic cases.  There were about 2.5 adults with flu symptoms for every adult who sought care for flu (in 2009).  And the most common type of flu test — a rapid test with a throat swab — has somewhere between a 30% to 50% false negative rate (50% to 70% sensitivity, per the CDC).

As it turns out, there are two (or three) completely different ways of going about estimating total cases of flu — or COVID-19 — whether diagnosed or not.

First, you can draw blood samples from some population and look for antibodies.  The idea is that most (but not necessarily all) individuals who have had a COVID-19 infection will still have the antibodies to it.  This is how most (but not all) countries have gone about estimating total cases, and how the US CDC provided some early, very preliminary estimates of total cases versus diagnosed cases.  (This blood-test approach is also how the CDC evaluates the effectiveness of flu vaccine every year (Post #741), and that’s how we know the fraction of flu cases that are asymptomatic.)

With that method, you just get a number that’s completely separate from your count of diagnosed individuals.  Presumably, it’s a much more trustworthy number.  And the ratio of total cases to diagnosed cases is just the ratio of those two completely independent numbers.

The huge weakness of this approach is the population.  In the U.S., for COVID-19, they didn’t randomly select individuals and ask for a blood sample.  They used blood that was collected for some other purpose (e.g., cholesterol test, health panel screening, routine testing) and tested it for COVID-19 antibodies.  So you have a sample, of something.  And you probably get a pretty accurate estimate of the fraction of those samples that have COVID-19 antibodies.  But it’s definitely not a cross-section of the U.S. population.

Second, you can start with your known count of cases, and inflate those up by some sort of estimates of under-reporting.  Some estimate of people who were infected, but never had symptoms.  Or an estimate of the fraction who had symptoms, but never went to the doctor.  Went to see a doctor but didn’t get tested.  And the fraction who got tested, but had a false negative.

With that method, you just get a number that’s a known multiple of your count of diagnosed individuals.  You don’t have any independent estimate of disease prevalence.  And all of the art is in figuring out some data-driven way to inflate your test counts to arrive at your estimate of total cases.

The huge weakness of this is, of course, those estimates.  Where, exactly, would you get an estimate for people who have COVID symptoms, but didn’t go to the doctor?  And where could you even think of getting an estimate for asymptomatic individuals?  It’s not as if you can ask people whether they didn’t have COVID-19 symptoms.

That is, in fact, exactly how the CDC currently estimates total flu cases each year.  As described on this CDC web page and subsequent links.  They take their data-based estimates of hospitalizations and outpatient visits for influenza-like-illness (ILI), then inflate those using some estimates of the factors listed in the paragraph above.

And this is how the CDC produced its most recent “eight-times” estimate of total COVID-19 cases.  They took the methodology that they use for flu, and modified it for use with COVID-19.

So that CDC “eight-times” estimate of total COVID cases doesn’t come from some independent data source.  It’s the count of people with a positive COVID-19 test, inflated for various factors to account for “under-detection” of COVID-19.

The CDC November 2020 “eight times” estimate.

You have to read the November 2020 CDC original research to see what they actually did, and what they actually said.  (In fact, you have to download the .pdf from that web page to get the details).

Even then, it’s impossible to figure out exactly what they did, from the publication.  Among other things, they typically assumed a range of estimates, then did the calculation using thousands of combinations of those numbers, drawn from those ranges. I’m going to ignore that nicety.

If you read through their Table 2, the bare bones of the calculation are pretty clear.  You start with the known count of COVID-19 tests, and then you keep inflating that for the cases that you think you are missing.  And, obviously, I’m rounding their numbers as I go through this, for the outpatient setting.

  • Positive COVID tests
  • x 1.12 (account for false negative rate for COVID PRC test)
  • x 2.00 (account for people who had symptoms, went to doctor, didn’t get tested)
  • x 3.00 (account for people who had symptoms, didn’t bother to go to doctor)
  • x 1.16 (account for people who never had symptoms).

When you multiply that all up, you end up with 7.8 times as many actual COVID-19 cases, as you had positive COVID-19 tests.  Which is almost an exact match for the 7.7 net total multiplier calculated from the final results of the article itself. I think that shows that this is, in fact, a reasonably fair representation of their methods.

I’m going to provide some details, and point out some potential weaknesses of this.   Putting aside questions about the data sources (most of which are, at best, samples of convenience), one thing stands out as a potential bias in this calculation.

As an economist, if I had to point out the single largest potential for error, it’s that they do nothing to account for self-selection.  In other words, I believe they assume that those who saw a physician but weren’t tested were as likely to have COVID-19 as those who saw a doctor and were tested.  They assume that those who had symptoms, but didn’t see a doctor, were as likely to have COVID-19 as those who had symptoms and were seen in an outpatient setting.

In both cases, I think that probably leads to an over-estimate.  I would bet that doctors were in fact more likely to test those whose symptoms were more severe and most closely matched the COVID-19 profile.  And similarly, those who went to an outpatient setting for care probably had more severe symptoms, and a closer match to the COVID profile, or a known exposure to COVID, relative to those who didn’t bother to go see a physician.

And so, even if everything about this analysis was perfectly accurate (and it’s not), I’d expect this to give an upper bound on the true count of cases.

Details follow.  In tiny type, because, let’s face it, on this website, nobody but me ever gives a crap about the details of the methdology.

1:  They started from the known count of diagnosed cases (positive PCR test).

2:  They assumed a roughly 11% false negative rate.  They got that by taking a single meta-analysis that listed false negative rates of anywhere from 2% to 21%, and just assuming a uniform (straight-line) distribution between those two numbers.  So the average false negative rate would have been about 11%.

If I were to guess, I’d guess that’s too low, but it doesn’t much matter in the overall calculation.  This would increase the number of true positives by about 12%.

3:  They estimated that a little over half of people who showed up with respiratory illness in the outpatient setting got a COVID test.  That’s based on two samples-of-convenience.  One is a set of (what I believe are) electronic medical records maintained by an IBM Watson system.  The other is self-reports based on persons who reported symptoms and care via a single internet website (COVID and You.)

And so, as I understand it (though not explicitly stated), they assumed that the persons not tested actually had COVID at the same rate as the persons tested.  I think.  So this would increase the number of cases roughly two-fold (since half of the people who sought care weren’t tested).

4:  They estimated that about a third of individuals who had symptoms actually sought out outpatient care.  That’s again based on a self-reported sample of convenience (Covid and You and Flu near You).  I’m not really sure what flu is doing here, but that’s the citation.

I believe they assumed that those who didn’t seek care were as likely as those who did seek care to be true COVID-19 cases.  (Although that was not explicitly stated).  And so this would increase the count of COVID cases roughly three-fold.

5:  Separately, they had an estimate of the fraction of cases that were asymptomatic.  Looks like they assumed an average of about 16%.  This would then increase total cases by … a further 16% of the estimated grand total, I think.  Near as I can tell from scanning the references, that’s based on the prevalence of asymptomatic viral respiratory infections of all types.  (And so, it’s no great surprise that it matches the 16% for flu infected, cited earlier in this post.)

The other CDC November 2020 estimate:  Seroprevalence of COVID-19 antibodies.

The CDC “eight times” estimate got all the headlines.  But in fact, in a completely separate effort, CDC staff also published data from which you could perform the same calculation, but in a much more straightforward manner.  That’s this entirely different November 2020 CDC research.

And although they caveat it, and say you shouldn’t use it to estimate national prevalence of disease, and so on, they provide all the data to allow you to do just that.  And when I did the obvious thing and aggregated their state data to the US, their seroprevalence data suggests something more like a four-times or five-times estimate.  That is, total COVID-19 cases were about five-times or four-times the number of diagnosed cases, as of their study periods (roughly August 1 for 5x, September 15 for 4x).

For this estimate, they have blood drawn by labs in all 50 states.  The blood was drawn for some other purpose — typically routine testing of some sort.  But the CDC paid to have it rested for COVID-19 antibodies.

Their earliest round of national data came from roughly August 1, 2020.  For that round, if you add up the numbers in their Table E2, you find about five times as many COVID-19 cases from their blood samples, relative to the count of diagnosed cases at roughly the same time.

In the final round, circa September 15, 2020, doing the same thing, you come up with about four times as many COVID-19 cases from blood samples as were diagnosed (at that time) from PCR tests.  It’s not completely clear that this represents an actual shift in the underlying data, or just a drift in the methods (which labs, which locations) between the two rounds.

If you look at the site-by-site detail, it’s clear that you can’t trust the state-level estimates.  Too much must depend on the exact sample-of-convenience in each state.  In some states, for some rounds, they actually found fewer infections in the blood samples than had been reported via nasal-swab PCR testing.

Beyond the obviously-too-low data in some states, there is, I think, some behavioral reasons to think that these results understate true prevalence.  These people were all under a physician’s care, and most were presumably being tested for some type of chronic condition.  My guess is that the population getting their blood drawn is going to be more cautious about getting COVID-19 than the population in general.

Bottom line.

I can’t really say that either method represents a gold standard.  The most recent CDC “modeled” approach relies heavily on some fairly questionable data.  For example, a key parameter is based on a single website where individuals voluntarily self-report flu and COVID symptoms.  The seroprevalence data, by contrast, are a sample of convenience, consisting of blood that was drawn for some unrelated purpose (e.g., cholesterol testing) in an outpatient setting.  And it’s clear that small-area estimates from the seroprevelance data are all-but-unusable.

The upshot of this is that, even ignoring the potential for state-to-state variation, I need to temper any claim that (e.g.) North Dakota has probably achieved herd immunity.  It all hinges on the correct estimate of the fraction of cases that have gone undiagnosed in North Dakota.  And all I have for estimating that is a couple of not-very-good national estimates, with a lot of air between them.

This does not exhaust the available data sources.  I may come back and review the remainder of the CDC’s seroprevalence data at a later date.

Finally, for completeness, I should mention that there is yet a third way to get at this figure.  For the Wuhan epidemic, mathematical analysis fairly convincingly showed that there had to be a very large population of undiagnosed individuals, in order to account for the rapidity of spread.  So you can arrive at some sort of an estimate by “curve fitting” to the pandemic data — a purely statistical approach.  I don’t think anyone has done that for the U.S. pandemic.