Post #1309: Ivermectin

 

At the end of August, I got a request to look at the evidence regarding ivermectin as a treatment for COVID-19.  Before you scoff, let me be clear that you can’t just dismiss something like this out-of-hand.  That drug wasn’t chosen out of thin air.  There were sound, basic-science reasons to think that it might — emphasis might — interrupt a key part of the replication of coronavirus.

But I was stumped by what I saw.  I could not make head or tail out of the scientific literature on the topic.

That was unusual.  If you know what to look for, typically a question as straightforward as “does this drug work” has a reasonably clear answer.  Even if the answer is an unsatisfying “somewhat” or similar mediocre result.  At least the evidence will clearly show that mediocrity.

But in this case, the techniques that had served me so well in my career completely failed.  There seemed to be really compelling evidence pointing toward completely contradictory conclusions.  And in my orderly universe, that just shouldn’t happen.

Today, I finally understand why the literature on ivermectin and COVID was so abnormally confusing:  They lied.  They, being authors of the studies showing the strongest results for ivermectin as a COVID treatment. Everything from including deaths in the control group while excluding them from the ivermectin group, to just plain Xeroxing blocks of data to make a tiny study appear to have had huge numbers of participants.  You can read about it in “The Real Scandal About Ivermectin”, by James Heathers, in this month’s copy of The Atlantic.

So now, belatedly, I guess I can offer an opinion: Nope, it doesn’t work.  Plus, liver toxicity makes it downright dangerous.  If you’re of a mind to self-medicate for COVID-19, try vodka instead.  It’ll be every bit as effective as ivermectin and you’ll end up with less liver damage.

And a lesson learned for me.  I’ve reviewed my share of scholarly publications in health services research.  I’ve seen things get published despite glaring errors.  Heck, I used to earn good money by pointing out the errors, to the people who really needed to know the underlying reality (as in this report, for example).  So in a very real sense, I profited from the lack of rigor in the standard scholarly review.

But those always seemed to be honest errors.  It was always an author who used some non-standard method, got some spectacular result, and let newsworthiness get in the way of objectivity or skepticism.  Or who didn’t understand some basic point of statistics.  In my entire career, I never ran up against results that I thought were deliberately fabricated.  Heard about it, but had never seen it.  And now I know better.


Strength-of-inference hierarchy as a way to sort through the medical literature on a drug.

Let me first narrow this down. 

In the social sciences, you get all kinds of  statistical studies that purport to provide evidence of something.  For a lot of those, broadly defined, it’s just extremely hard to say, one way or the other, whether you’re actually getting useful information or not, from any particular study.

If you want a concrete example, just think about all the seemingly-contradictory “scientific” information you’ve seen about diet.  Take any non-extreme point of view, Google for studies showing that’s good for you, and chances are, you’ll be able to find some.

There are concrete reasons why that literature, in particular, is such a mess.  I’ll just note my favorite one, which almost everyone ignores.  And that is, almost anything is healthier for you than the SAD (standard American diet).  And so, if you take a bunch of average Americans and put them on a diet — it doesn’t really matter which — sure enough, their health will improve.  Low-fat?  Low-sugar?  Vegetarian?  Paleo?  Mediterranean?  Low-salt?  High-fluid?  Anti-inflammatory?  Biblical foods only?

To a close approximation, the particular question doesn’t much matter.  The answer is “yes”.  Yes, a controlled plan of eating fill-in-the-blank is superior to uncontrolled consumption of SAD.  Feel free to fill in that blank with any non-extreme diet.

So, to be clear, in the outline below, we’re only considering a small subset, of the medical literature, dealing with the idea of purposefully consuming some drug, in order to treat a disease.

And so it boils down to a very simple question, which is, how sure can you be that taking that drug will actually improve the outcome of that disease.  What’s the strength of inference?

Finally, there’s a not-so-subtle issue that layers on top of this, in the practice of medicine.  And that’s whether people will actually use the proposed treatment, correctly, once prescribed.  I can never remember the term of art, but I think that’s the difference between “effectiveness” and “efficacy”.

Example:  Eat less and exercise more is absolutely an effective way to lose weight.  It’s just not something that most obese people will sustain.  Perfectly good advice, not not a solution for the obesity epidemic.  We’re abstracting from issues of that sort.

Without further ado, my hierarchy of strength-of-inference

Here’s my list, for all the types of studies you might commonly see that try to answer the question “will taking drug X improve outcomes in disease Y”.  I’ve listed them from lowest to highest strength of inference.

Unlike the opening section of this post, all of this assumes that the research is honestly and competently performed and presented.

  1. Computer modeling of drugs for potential effectiveness in a disease.
  2. In vitro (“test-tube”) studies, as of cell cultures.
  3. Studies in animals (mice, rats, … primates).
  4. Non-randomized trials in humans, aka “observational data”.
    1. Cross sectional.
      1. Cross-sectional without comparison group.
        1. Anecdotes (case studies).
        2. “Open-label” studies
      2. Cross-sectional with comparison group.
        1. And small/no differences in treatment vs. control behavioral.
        2. With large/obvious differences in treatment vs. control behavioral
    2. Time-series:
      1.  Pre-post comparison, no control group.
        1. Anecdotes (case studies).
        2. “Open-label” studies
      2.  Pre-post comparison with control group.
      3. Crossover designs (swap treatment and control groups over time).
  5. Small-scale randomized trials, typically not double-blind.
  6. Large, properly-designed double-blind randomized trials.

The first and most obvious question is, why don’t people just do research “right”, that is, study things with randomized trials only?  There are a lot of reasons, but foremost is cost.  I believe that the recent COVID-19 vaccine trials had an average cost of somewhere between $5,000 and $10,000 per observation.  (And so, when you read about the “30,000-person trials” of these vaccines?  Think “third-of-a-billion-dollars”.  Not to develop the vaccine, or to manufacture it.  Just to test it at that scale.  This is why drug trials often migrate to developing nations, where costs are lower.  And why statisticians who can do an accurate “power test” are key to minimizing costs, because you really don’t want to pay for one more observation than you need to, in order to prove that your drug works.)

The second — and maybe not so obvious – reason is that there are typically a lot of potential candidates.  At least at the start of research.  And so, weeding those out and/or selecting them in matters greatly, from the standpoint of both feasibility and cost.  Drug companies typically test a whole of of stuff, at some level, before they come up with one that seems to work.  They have to be pretty sure something will work before they’ll shell out the big bucks for a randomized trial.

The third reason is that all of the “observational” methods are widely available.  Any physician can feel free to write up an interesting case study (a patient or handful of patients.)  Anyone with access to electronic medical records or (in my case) abstracts of bills can do a comparison of individuals who did and did not receive a particular treatment.  And then publish the results.

It’s not that these non-randomized methods are useless.  They are a legitimate path toward identifying what works and what doesn’t.  They can provide some useful information, in some cases.  It’s just that they have low strength-of-inference.  They might give you a hint regarding some particular drug/disease combination.  But they can’t prove the point.  Which is why, for example, the FDA requires randomized controlled trials for any sort of newly-approved U.S. drugs.

A final reason is that in some cases (not necessarily this one), observational data is all you can get, and all you ever will get.  It would be difficult to do a randomized controlled trial of (say) whether air bags save lives in head-on collisions.  There, the only thing you can do is appeal to first principles, or maybe compare equivalent crashes, from equivalent models, in the pre-air-bag era.

And so, what you typically find is that research moves down this hierarchy over time.  You start with some vague notion that some class of drugs might have an effect.  You mix them up with some cultured cells, in a test tube, to see if that’s even remotely plausible.  If possible, you try them on some cheap lab animals.  You ask whether anyone is using those drugs now, and you get the observational data to see whether or not there’s any effect apparent.  And then, when you’ve got some notion that the drug actually works, you test it formally with an expensive randomized trial.

Place research on that hierarchy, then look for the signature of publication bias.

The best advice I ever came across on doing a literature review was from Light and Pillemer. They argue strongly for a quantitative review of scholarly literature.  It’s not enough just to give the usual text recitation-of-study-after-study.  You need to make some judgement about the likely accuracy or strength-of-inference of each study, and plot the results accordingly.

In the ideal case, you literally pull the key numbers from each study and plot them on a graph.  You start with the the earliest, lowest-quality, least-certain studies on the left, moving to the highest-quality strongest studies on the right.  If there is something to be found — some underlying true-and-real effect — the sorted results will converge to an answer.  Light and Pillemer present that as a “funnel plot”.  The reported answers will be all over the place among the less-reliable studies, but should narrow down as you move toward more reliable studies.  So that the resulting plot is funnel-shaped.  And if the answers don’t settle down, then chances are that there is nothing there.

In this case, if you are looking at nonsense research results, you’ll see a big break between the results at Level 4. and at Level 6. on the hierarchy above. 

And that’s my main red flag — that big about-face in the results when you finally get a randomized clinical trial.  That’s a red flag for three reasons:  shotgun research plus publication bias plus the tyranny of the t-statistic.

Shotgun research:  Up through level 4. in that hierarchy, anybody can take a shot at it.  In the computer era, it doesn’t cost much to (e.g.) re-process electronic medical records data.  But the methods are inherently error-prone, or to rephrase, at lot of people have the opportunity to take a shot at research, but they are all using guns of dubious accuracy.

Publication bias:  But if, by chance, you happen to hit the bullseye, then that’s publishable.  If you do 20 studies of a particular effect, then, just by chance, one of them is going to find a result that is “statistically significant at the 5% level”.  (That’s what the 5% level means.)  And that one will get published.  Because positive findings – rejecting the null hypothesis — is the way the entire process works.

The tyranny of the t-statistic.  What’s worse, you will tend to get those chance “statistically significant” findings only when you get a really spectacular result.  For a given amont of random variation in the data, it’s the extreme results that will generate the coveted “t-statistic > 2”, and pass the standard test of statistical significance.

The upshot?  What you end up seeing published, up to level 4. in the hierarchy, is not just studies that say a drug works, but studies that say a drug works magnificently well.  By chance, this screening process tends to produce a body of literature showing absolutely phenomenal, promising results.

(Of course I have an anecdote here.  It’s about losing a client.  My client had several different small-scale observational studies of his medical device.  Each study measured outcomes along several different dimensions.  He wanted me to build a “value model” showing the benefits hospitals would get from this device, by taking all the best results, along each dimension, from among all of the studies.  (In other words, gather up all the bullseyes, and ignore the rest.   I refused, and explained why that would be misleading, using the arguments you’ve just read above.  I got fired.)

And then those spectacular results evaporate in light of a proper randomized double-blind experiment.  That complete about-face between observational data studies and randomized controlled trials is the signature of publication bias.

Don’t misinterpret this.  If a drug actually works, then (by-and-large) it may sail through those first four levels of the hierarchy as well.  That may be the only way it gets entered into a randomized trial.  But it’s only when the result don’t flip around, at the randomized trial stage, that you can be truly confident that you’re looking at a real impact, and not just artifacts of unreliable methodology and publication bias.

The canonical crash-and-burn.

This leads to what I call the canonical crash-and-burn.  This is a pattern that you will see repeated for many of the promising-but-ultimately-ineffective treatments for COVID-19.

It’s not that you see early research on some drug, and maybe it provides some modest benefit, but that benefit can’t be found in a proper randomized trial.  It’s that, for the combination of reasons outlined above, you always hear about drugs that seem to offer GAME-CHANGING RESULTS, and then those results disappear once you do a randomized trial.

And when you see that, believe the randomized trial, not the prior research.

This is where all the hard-core believers in these drugs just fail to understand the process.  To them, it’s all research.  They literally don’t grasp the hierarchy above.  So that when they see those spectacular results early on, they stick with that.  They think, well, hey, some say yes, some say no, who’s to say?  How could you have gotten those spectacular “yes” results if it didn’t work?  And so, when a proper randomized trial shows nothing, they don’t understand that the randomized trial results aren’t just another bit of evidence, the randomized trial results replace everything above them on the hierarchy.  

The key to much of this is understanding that you only get to see what makes it into publication.  Drugs have to pass those in-vitro tests before anybody will bother to test them on animals (if appropriate).  By-and-large, they have to pass those animal tests before they become available for human use.  Practically speaking, those observational studies in humans have to show results before anybody will bother to publish them.  Separately, small-scale observational studies are likely to have the required “t-statistic greater than 2.0” when they, by chance, identify some seemingly-spectacular results.

And as a result, everything about the research and academic publication industry means that results bubble up to the point where they hit the popular press if and only if the results look pretty damned good.  The gooder, the better.

And then reality intervenes, in the form of a reliable randomized double-blind controlled trial.  Which may or may not confirm the results based on the less-reliable methodologies.

And if not, then the house of cards collapses.

Except for the population that prefers to continue to live in that house.  Largely consisting of people for whom all research is more-or-less hocus-pocus.  Who have no way to differentiate research based on a strength-of-inference hierarchy.

And who are probably out there right now, shopping on-line for ivermectin with a side order of hydroxychloroquine.

Ultimately, the message is that when the facts change, you need to change your mind.  And when spectacular results appear to evaporate in the face of an actual randomized trial, that means the results were imaginary to begin with.  No matter how much you want them to have been real.