How to Irritate Regulators: A Primer
Janet Wittes (Florida Atlantic University, janet@wittesllc.com)
“…I wouldn’t stand by and see the rules broke – because right is right and wrong is wrong, and a body ain’t got no business doing wrong when he ain’t ignorant and knows better.”
Tom’s lesson to Huck in The Adventures of Huckleberry Finn – Mark Twain
Highlights
What are important elements that can make or break a trial's success? Reflect on properly designing and executing randomized controlled trials that not only meet regulatory standards but also ensure reliable results.
What are some common strategies that can irritate regulators and jeopardize your trial's approval? From ambiguous hypotheses to operational deficiencies, what critical mistakes should you avoid?
Dive into two real-world case studies: the 1993 NEJM paper on respiratory syncytial virus immune globulin and the recent study on pimavanserin in dementia-related psychosis. What went wrong, and what lessons do these examples teach us?
As we statisticians are wont to do, I start with a set of assumptions so you the reader know where I am coming from. I assume that you are designing, carrying out, or analyzing data from a Phase 3 randomized controlled trial conducted by a pharmaceutical company. You are honest; you do not commit fraud; and you are after truth. You know that the purpose of your trial is to assess the efficacy of Drug D in Population P with Condition C, but you are really thinking, “This trial is designed to show that D is efficacious in P with C.” You come to the project armed with new statistical methods at your fingertips and the promise that artificial intelligence (AI) can help lead you on your path to truth. You know that to land in good graces with regulators the study should have a design that addresses the question of interest; the protocol should be as unambiguous as you know how to make it; the statistical analysis plan (the SAP) should be rigorous but not so opaque that only statisticians can understand it; the study sample size should be large enough to provide a reliable answer to the questions posed; the team must carry out the study nearly perfectly. If all this is true, and if D really benefits P with C, the answers will emerge as you expect. You, with the rest of the study team, will write the Clinical Study Report and the primary manuscript carefully, following the SAP precisely. If you conduct analyses not identified in the SAP, or if you modify the methods even slightly, you will forthrightly declare the changes as post hoc. You will report p-values only when they reflect statistically valid tests and you will present numbers to a reasonable degree of precision.
But sometimes (and all too often) the ideal does not hold. You may have found yourself involved in a study that someone else had designed before you joined the team and the protocol, SAP, or both were, in your mind, flawed. The study operations may have had problems beyond your control. Some of these deviations will irritate the regulators who will review your application. You know that if you try to hide them, the regulators will find them and be even more irritated because they do not appreciate shading the truth. If you had wanted to irritate them even more, you should have designed your study without an unambiguous hypothesis. That would have allowed you to tweak your hypothesis to make it fit your data. You could have written the SAP in a way that is unclear to give you lots of wiggle room. You could have selected a sample size too small to produce reliable answers to the questions the trial is posing. You could have winked at the sloppiness of the operations by, for example, not pressing investigators to do all they could reasonably do to prevent missing data. And then, if the answers from your study differ from expected (e.g., the p-values are not below 0.05), you could rely on the inconsistencies and lack of precision in your documents to construct an analysis, find a subgroup, and search for an outcome with a comfortably small p-value. You may be able to use some of your many new statistical methods to find the most convincing result. Perhaps you ask your AI tool to help you find a summary that you hope will convince the regulators.
When you write your Clinical Study Report and your manuscript, you could avoid identifying what had been prespecified and what was ad hoc. You might toss out observations that make no sense to you without questioning whether the apparent anomaly had a reason (maybe the use of a different unit). Another approach might be to use all observations in the database even when some are ridiculous. One trick I have seen comes from a large Contract Research Organization. Suppose an observation that is obviously incompatible with life has had an attached text notation that says, “This blood sample came from a cow.” Use the data and Ignore the text. “I don’t manipulate data,” you say to yourself self-righteously. When confronted with the absurdity of the number, defend yourself with, “My program doesn’t read text.” (Confession: I am only slightly exaggerating this story; the cow was from one study; the quotation from another.)
You know that you will irritate the regulators, and feel uncomfortable yourself, if you sprinkle your report with p-values placing particular attention on the small ones, even those that are data-dredged. Pretend that saying, “Please interpret this with caution”, absolves you of responsibility. Use lots of digits for your p-values to emphasize how certain you are. (Why should anyone believe p=0.001452 is an accurate representation of a probability?) If you feel your argument is not strong enough, use the coup de grace appeal to biology to defend the clinical and statistical meaningfulness of your post-hoc p-values. Lest you think that no one would use these strategies, the following two examples may change your mind. The first dates from 1993; the second dates from over a quarter century later.
In 1993, the NEJM published a paper on the use of respiratory syncytial virus immune globulin as prophylaxis for RSV infection in high-risk infants and young children [1]. The paper described the primary endpoint as follows:
The sample size was determined on the basis of two primary end points: reduction in the incidence of lower respiratory tract infection caused by respiratory syncytial virus and reduction in the severity of respiratory syncytial virus disease.
The paper provided no discussion of the planned analysis of the two doses or the two endpoints. The tables in the article presented many p-values (18 in Table 3; 8 in Table 4), some of which were below 0.05. An editorial in the same issue of the journal was very positive, “RSV – successful immunoprophylaxis at last”. Two weeks later the FDA Blood Products Advisory Committee voted against approval. The FDA, on reviewing the data, did not approve the product. In a letter to the editor, Ellenberg et al. [2] explained the reason for the FDA’s rejection. The letter pointed out that the paper failed to describe the method of randomization. The paper said the treatment was intent-to-treat but eight children, seven active and one control, had been removed after randomization. Moreover, 17 children whose caregivers had signed an informed consent form were not included. Importantly, the study was not blind, so the failure to include so many children could not be attributed to chance. Clearly, the authors had irritated the FDA. As a post-script, a subsequent study in children with cardiac problems [3] showed benefit and the product was approved for them.
The second example comes from a much more recent study, one that examined the use of pimavanserin (the D) in dementia related psychosis (the C) [4]. The population P was complicated. On one level it was composed of people with a history of dementia-related psychosis regardless of the underlying type of dementia. On another level, it was a population comprised of people with Alzheimer’s disease dementia, Parkinson’s disease dementia, dementia with Lewy bodies, frontotemporal dementia, and vascular dementia, all of whom had a history of dementia-related psychosis. What is the P: Is it a single P with the five types of dementias simply being subgroups, or is it five populations with an umbrella diagnosis of dementia-related psychosis? Interestingly, the NEJM didn’t report data by dementia subgroup. Importantly, prior to this study, pimavanserin had received a label for Parkinson’s hallucinations and delusion, not exactly the same as dementia-related psychosis, but pretty close.
The study at hand was a randomized withdrawal trial. All study participants had received open-label pimavanserin for 12 weeks. Then, those eligible were randomized 1:1 to a double-blind, placebo-controlled trial with time to relapse of psychosis the primary outcome. The FDA presented all the data below at an advisory committee meeting in 2022 [5].
The trial had a prespecified interim analysis with a stopping guideline p=0.0066. At the interim analysis, the data showed a hazard ratio of 0.35, a 95% confidence interval of (0.17, 0.73), and a p-value of 0.005 (Table 2). The analysis included the 194 participants who had completed the study at the time of the interim; 23 others had been randomized but had not been followed long enough to be included. On the DSMB’s recommendation, the sponsor stopped the study declaring efficacy. Whether the DSMB looked at the data by type of dementia does not seem to be a matter of public record.
At the advisory committee evaluating the effect of pimavanserin in dementia-related psychosis, the FDA commented that the overall results appeared to be driven by the small Parkinson’s disease subgroup, but the drug had already been approved for patients with Parkinson’s. The set of other dementias showed no convincing evidence of effect. The largest subgroup was Alzheimer’s disease dementia. Given the importance of Alzheimer’s for public health and the equivocal results in that large (N=123) subgroup, the FDA concluded that the data did not show enough evidence of benefit in the non-Parkinson’s dementia participants to grant the drug a label wider than the one it already had.
So just as a sloppy protocol, operational deficiencies, and violations of the SAP can irritate the FDA, the primavanserim study is an example in which slavish adherence to a protocol and SAP can be an irritant as well. Beware feeding your protocol, SAP, and data into AI without thinking about the meaning of what you are asking it to do. Here the prespecified analysis showed a dramatic result, but failing to look into the subgroups, especially those not already covered by a label, must have frustrated the FDA reviewers.
To summarize, some motherhood and apple pie principles to prevent you from irritating reviewers and, in addition, keeping yourself in your own good graces:
Believe in the importance of randomization (don’t just give it lip service). That will make you worry about missing data and prevent you from believing that post-randomization subgroups provide causal information.
Work really hard to write a sensible, rigorous, doable, protocol and SAP. And if you join a study that already has a flawed SAP, do what you can to correct it before unblinding the data.
List your assumptions clearly. If you are using complicated methodology, make sure that you understand the assumptions implicit in the model. If you are using AI, be especially careful that you check it is not following the rules too literally.
Don’t hide the warts in your study or your analyses. A reviewer will find them – far better for you to expose them and then, if they do not worry you, explain why you find the data convincing if, in fact, you do.
And the hardest principle of all: resist getting seduced by your prior convictions. Be honest to yourself and others about your findings. Remember Tom’s advice to Huck, “a body ain’t got no business doing wrong when he ain’t ignorant and knows better.”
Comments: This paper is based on a talk I gave as part of a Master Class in Statistics at the 2023 meeting of the CardioVascular Clinical Trials (CVCT) meeting. My thanks to Faiez Zannad, MD, and Stuart Pocock, PhD, for inviting me to participate in the class and to Vijay Kumar, MD, of the FDA for encouraging me to write it up.
References
[1]. Groothuis JR, Simoes EA, Levin MJ, Hall CB, Long CE, Rodriguez WJ, Arrobio J, Meissner HC, Fulton DR, Welliver RC, et al. Prophylactic administration of respiratory syncytial virus immune globulin to high-risk infants and young children. The Respiratory Syncytial Virus Immune Globulin Study Group. N Engl J Med. 1993 Nov 18;329(21):1524-30. doi: 10.1056/NEJM199311183292102. PMID: 8413475.
[2] Ellenberg SS, Epstein JS, Fratantoni JC, Scott D, Zoon KC. A trial of RSV immune globulin in infants and young children: the FDA's view. N Engl J Med. 1994 Jul 21;331(3):203-5. doi: 10.1056/NEJM199407213310315. PMID: 8054049.
[3] Simoes EA, Sondheimer HM, Top FH Jr, Meissner HC, Welliver RC, Kramer AA, Groothuis JR. Respiratory syncytial virus immune globulin for prophylaxis against respiratory syncytial virus disease in infants and children with congenital heart disease. The Cardiac Study Group. J Pediatr. 1998 Oct;133(4):492-9. doi: 10.1016/s0022-3476(98)70056-3. PMID: 9787686.
[4] Tariot PN, Cummings JL, Soto-Martin ME, Ballard C, Erten-Lyons D, Sultzer DL, Devanand DP, Weintraub D, McEvoy B, Youakim JM, Stankovic S, Foff EP. Trial of Pimavanserin in Dementia-Related Psychosis. N Engl J Med. 2021 Jul 22;385(4):309-319. doi: 10.1056/NEJMoa2034634. PMID: 34289275.
[5] The information had been on the FDA’s website in the past but on 11-April-2025 the documents from the advisory committee were no longer available (or, to be more accurate, I couldn’t find them). So the data I am presenting come from my records – I would have checked their accuracy had I found the data on the website.