The Evolving Role of Statisticians in the Pharmaceutical Industry: Leveraging Advanced Statistical Analytics and Artificial Intelligence
Haoda Fu (Amgen), H. Amy Xia (Amgen)
Highlights:
Over the past century, the role of statisticians in the pharmaceutical industry has evolved—from service analysts to strategic decision drivers who define evidence, quantify uncertainty, design clinical plans, and embed statistical rigor into model-driven decisions across the value chain.
Today, technology is rapidly advancing, and the definition of data is expanding beyond traditional tabular formats to encompass multi-modal sources such as images, text, audio, video, omics, wearables, and real-world data. Guided by sound statistical principles, we are moving from digitization to datafication, to knowledge creation, and ultimately to intelligent decision-making—where statisticians ensure rigor, quality, and trust.
Looking ahead, statisticians will continue to evolve as architects of the analytical ecosystem—integrating AI, automation, and reproducible workflows to accelerate insights while maintaining transparency, interpretability, and regulatory compliance.
Abstract
In today’s pharmaceutical industry, statisticians play a central role in turning large and complex data into reliable evidence and actionable insights. Their work connects data, science, and technology to support faster and more efficient drug discovery and development. With growing access to real-world data, genomics, and digital health information, along with rapid advances in computing and artificial intelligence (AI), the role of the statistician has expanded far beyond traditional boundaries. This paper reviews the evolution of statisticians in the pharmaceutical field, starting from their early focus on sample size justification and data analysis in late-stage clinical trials to their current position as part of key decision-making throughout the entire drug development process—including discovery, clinical trial design, manufacturing, and commercialization. We highlight major changes that supported this shift, such as improvements in statistical computing, new regulatory guidance, and the adoption of advanced methods like adaptive designs, Bayesian approaches, and simulation studies. We also examine how statisticians are using AI and machine learning for drug discovery, and to improve trial efficiency, generate insights from real-world evidence, and support innovation across the value chain. These changes create new opportunities but also require statisticians to develop broader skills in programming, data science, and cross-functional communication. Looking ahead, we believe that statisticians will continue to be at the forefront of innovation in pharmaceutical research. By combining strong statistical thinking with modern tools and technologies, we can lead efforts to deliver better, safer treatments to patients more quickly. This paper offers a forward-looking view on how the profession can continue to grow and lead in a data-driven future.
Key Words and Phrases: Adaptive designs; Bayesian methods; Data science; De novo design; Drug discovery; Real-world evidence.
Short title: Evolving Role of Statisticians in Pharma
1 Introduction
Pharmaceutical statisticians have come a long way over the past half-century, evolving from backroom number-crunchers to essential contributors across the entire drug development spectrum. Once viewed primarily as support staff ensuring regulatory compliance, statisticians today are equal partners in research and development teams, influencing decisions from early drug discovery, clinical development to manufacturing and commercialization (International Council for Harmonisation, 2009, 2020; Chuang-Stein et al., 2010a) . This expanded role has been driven by multiple converging forces. Advances in computing and the advent of new data sources (e.g. genomics, real-world clinical data) have enabled innovative statistical methodologies, while the rise of artificial intelligence (AI) and machine learning offers powerful tools to extract insights from electronic information which was hard to analyze before (International Human Genome Sequencing Consortium, 2001; 1000 Genomes Project Consortium, 2015; U.S. Food and Drug Administration, 2023b; Concato and Corrigan-Curay, 2022; Vamathevan et al., 2019; Harrer et al., 2019). At the same time, the pharmaceutical industry’s external environment has grown more challenging – fewer new therapies are approved each year with increasing costs, and stakeholders demand greater transparency and evidence of value (Wouters et al., 2020). Statisticians have responded by embracing new analytic techniques and stepping into leadership and collaboration roles that were virtually unheard of decades ago (Chuang-Stein et al., 2010a; Senn, 2021). This article explores the trajectory of statisticians’ responsibilities in the pharmaceutical industry, with an emphasis on how advanced analytics and AI are shaping the present and future. We review the historical context that set the stage for today’s trends, analyze key drivers of change (from big data to Bayesian designs to AI) (U.S. Food and Drug Administration, 2019; Chen et al., 2023), discuss current and emerging applications of AI in drug development (Vamathevan et al., 2019), and examine how the statistician’s influence now extends across the pharmaceutical value chain. We also highlight the growing importance of interdisciplinary collaboration – including engagement with regulatory agencies like the FDA – and consider what educational enhancements are needed to prepare the next generation of pharmaceutical statisticians. Ultimately, we aim to demonstrate that statisticians are not only adapting to an evolving landscape but are increasingly leading innovation in pharmaceutical R&D and beyond.
The following sections are structured as follows: Section 2 delves into the historical context of the evolving role of statisticians, setting the foundation for current trends. Section 3 examines the key drivers of change, including advancements in statistical computing, methodological and design innovations, and the emergence of new data types. Section 4 explores the current and emerging applications of AI within pharmaceutical companies and how statisticians’ influence now permeates the entire pharmaceutical value chain. Section 5 concludes with a discussion on the increasing importance of interdisciplinary collaboration and the educational advancements necessary to equip the next generation of pharmaceutical statisticians.
2 Historical Context of Statisticians’ Roles in Pharmaceutical Industry
The use of data and statistics to improve patient outcomes has been a part of healthcare for thousands of years and remains crucial today. An early example of data-driven healthcare is in the Bible’s “Book of Daniel” from 500 BC. King Nebuchadnezzar of Babylon believed a diet of meat and wine would keep his people healthy. However, some young men chose to eat vegetables and drink water for 10 days. They appeared healthier, so the king allowed them to continue their diet. This was an early instance of using an experiment to make a health decision. In the 18th century, James Lind, a ship’s surgeon, conducted one of the first controlled clinical trials. He tested treatments for scurvy and found that oranges and lemons were effective (Lind, 1753).
Modern biostatistics in drug development began in 1946 with the introduction of randomization and controlled trials (Crofton, 2006). Randomization was first introduced in 1923, and Sir Austin Bradford Hill conducted the first randomized controlled trial in 1946, showing that streptomycin was effective for tuberculosis (Bothwell and Podolsky, 2016; Chalmers, 2003). This study demonstrated how randomization, control groups, and statistical testing could guide medical decisions. A significant change came with the 1962 amendments to the U.S. Food, Drug, and Cosmetic Act (Goodrich, 1963), following the thalidomide tragedy. These amendments required the FDA to demand “substantial evidence” from controlled trials to prove a drug’s effectiveness, not just safety. This led drug companies to realize the necessity of statistically designed trials for approval, leading to a surge in hiring statisticians to meet FDA requirements (Rodda et al., 2001). By the late 1960s and 1970s, statisticians were key members of clinical research teams, mainly designing trials, calculating sample sizes and analyzing data for regulatory submissions (Meadows, 2006).
In the 1970s and 1980s, the role of statisticians in pharma grew with new regulatory initiatives. A key development was the FDA’s New Drug Application (NDA) rewrite in the early 1980s, which required a formal statistical review for every new drug application and a statistician as a co-author of clinical trial reports (U.S. Food and Drug Administration, 1988, 1985). These changes solidified statisticians’ roles in the drug approval process. However, they were still seen as technical support, ensuring analyses were correct and compliant. As Rockhold (2000) noted, even after NDA reforms, statisticians mainly executed analyses and calculated sample sizes, rather than shaping study designs or development programs. Most focused on late-phase clinical trials and some manufacturing quality assessments, with little involvement in early research phases or non-clinical areas (Chuang-Stein et al., 2010a).
By the 1990s, several factors increased statisticians’ influence. Pharmaceutical R&D became more global and complex, with larger trials and more data. Regulatory agencies worldwide adopted harmonized standards for trial conduct and statistical practice. The International Conference on Harmonisation (ICH) issued guideline E9: Statistical Principles for Clinical Trials (Guideline, 1999), emphasizing the importance of statistics in trial design, analysis, and interpretation. According to Rockhold (2000), ICH E9 gave statisticians more “leverage and authority in drug development,” highlighting the need for a strong statistical foundation for credible evidence. Statisticians began contributing strategically, advising on clinical programs and study designs. The industry recognized that information is the key output of R & D, boosting the demand for statistical thinking to maximize data value in discovery, preclinical studies, clinical trials, and post-market surveillance.
Another milestone in the 1990s was the rise of powerful statistical software and personal computing, enabling advanced analyses and simulations. Statistical programming languages like SAS became essential tools for pharma statisticians. In the late 1990s and early 2000s, statisticians expanded into new areas: safety data mining for adverse event detection, support for epidemiological studies, and clinical pharmacology modeling (e.g., PK/PD analyses for dose selection). In the 2000s and 2010s, the statistician’s role expanded significantly. The FDA’s 2004 Critical Path Initiative aimed to modernize medical product development science, advocating for innovative statistical approaches (U.S. Food and Drug Administration, 2004). The initiative highlighted challenges like biomarker validation, enrichment trial designs, missing data handling, multiplicity issues, and model-based evidence, all requiring sophisticated statistical input. In the following decades, regulators released guidance documents on adaptive trial designs, non-inferiority trials, multiple endpoints, and real-world evidence, expanding statisticians’ toolkit and responsibilities in clinical development. By the 2010s, statisticians were seen as essential partners in drug R&D. As noted that statisticians were “absolutely critical for efficient and effective drug development”, serving as key contributors or consultants in all R&D areas. The role evolved from a support role to a strategic, interdisciplinary one, preparing statisticians to tackle 21st-century challenges, including the big-data revolution and AI integration in pharmaceutical research.
3 Key Catalysts for the Evolution of the Statistician’s Role
Several interrelated factors have accelerated the evolution of statisticians’ responsibilities in the pharmaceutical industry. Key among them are: advances in computing and software that exponentially widened analytic possibilities (such as SAS and R) (Ihaka and Gentle- man, 1996; Chambers, 1998; Segreti et al., 2001); the development of innovative statistical methodologies (such as Bayesian methods and adaptive designs) coupled with regulatory encouragement that fostered their adoption (Pallmann et al., 2018; U.S. Food and Drug Administration, 2019; Woodcock and LaVange, 2017; International Council for Harmonisation, 2025); and the emergence of new data types and large datasets (from real-world evidence to genomics and digital health) that demanded novel analytical approaches (Concato and Corrigan-Curay, 2022; U.S. Food and Drug Administration, 2018; Morris and Baladandayuthapani, 2017). These factors together have reshaped what pharmaceutical statisticians do day-to-day. We examine how each of these catalysts has contributed to the deepening and broadening of statisticians’ responsibilities in pharma, and we illustrate how statisticians’ skill sets, and influence have grown in response. We review the historical con- text that set the stage for today’s trends on the rise of AI in pharmaceutical research(Liu et al., 2023b; U.S. Food and Drug Administration, 2025b).
3.1 Advances in Statistical Computing and Hardware
Early pharmaceutical statisticians worked in an era of limited computing power, often per- forming calculations by hand or with basic mechanical aids. The mid-20th century saw the introduction of mainframe computers, but computational resources remained scarce and specialized. This inherently constrained the complexity of analyses that statisticians could practically undertake. Over time, however, revolutions in computing hardware and the advent of statistical software radically transformed the toolkit of the pharmaceutical statistician. By the late 20th century, improvements in processing speed and data storage (following Moore’s Law) (Moore, 1965) enabled routine execution of intensive methods that were previously impractical. In parallel, the development of high-level statistical programming languages and software packages – notably the Statistical Analysis System (SAS) in the 1970s and the open-source S language (and later R) in the 1990s – provided user-friendly platforms to implement complex analyses (Chambers, 1998; Ihaka and Gentleman, 1996). The widespread adoption of these tools in industry meant that statisticians could manage larger datasets and apply more sophisticated models with relative ease. The practice of statistics in pharma changed markedly over 35 years in tandem with advances in computational power (Segreti et al., 2001).
One direct outcome was the rise of simulation-based analysis and design. With greater computing resources, statisticians began to use Monte Carlo simulations to evaluate trial properties and optimize study designs before any patients were enrolled. For example, by the 2000s it became routine to simulate thousands of trial iterations to assess a design’s probability of making correct/incorrect decisions or to model various what-if scenarios for adaptive trials (U.S. Food and Drug Administration, 2019). Such computationally intensive work simply was not feasible in earlier decades. The increasing availability of fast computing also facilitated resampling and modern methods – techniques like the bootstrap (for estimating confidence intervals) and Markov chain Monte Carlo (for Bayesian analysis) gained traction in clinical research once computers could handle the necessary iterative calculations (Efron, 1979; Gelfand and Smith, 1990). The net effect was an expansion in statisticians’ capabilities: rather than being limited to relatively simple trial designs and analyses, they could now explore a much richer design space and fit more complex models to data. Indeed, contemporary statisticians often write extensive code (in SAS, R, or Python, etc.) to manipulate datasets, implement custom analyses, and even create interactive dashboards for data visualization, reflecting a blending of traditional statistical skills with what we now call data science (Chuang-Stein et al., 2010b).
Importantly, better computing didn’t just change how fast statisticians work – it changed what they work on. Previously, statisticians’ contributions might begin only after data collection (analyzing final trial results), but modern computing power allowed them to influence studies from the planning and design stage onward, running simulations to inform optimal sample sizes, endpoint definitions, and decision criteria. For instance, clinical trial simulation became an established practice for complex trial planning by the 2010s (U.S. Food and Drug Administration, 2019), allowing statisticians to quantify the trade-offs of various design choices under myriad scenarios. As data sets grew from tens of patients in the 1960s to tens of thousands of patients (or millions of observations) in the 21st century, the statistician’s role expanded to include ensuring data integrity, traceability, and reproducibility through efficient programming and validation (Segreti et al., 2001). The long history of success of SAS as a de facto industry standard is one testament to how central computing environments became to pharma statistics. More recently, open-source tools (R and Python in particular) have gained acceptance, further empowering statisticians to use cutting-edge techniques and share reproducible code (Chuang-Stein et al., 2010b). In summary, the dramatic improvements in hardware and the parallel evolution of statistical software over roughly 1950 to the present have been fundamental catalysts for change – transforming the statistician’s role from a manual calculator of p-values to a computational strategist capable of exploring vast design and analytic possibilities (Segreti et al., 2001; Rockhold, 2000).
Looking forward, we believe the next wave of computing advances will continue to shape the statistician’s role. For example, the current trial simulations primarily focus on addressing scientific questions such as family-wise type I error control, power, or the posterior probability of trial success. These simulations are often conducted before running a clinical trial. As computing power continues to grow, we expect statisticians to increasingly leverage real-time data during ongoing clinical trials to run simulations that address not only scientific questions but also operational questions, such as the consequences of opening additional sites to speed up enrollment. Addressing these questions can further lead to optimizing clinical trial operations at each step, conditional on what has already happened in the trial. We envision that statisticians, collaborating with cross-functional teams, will be responsible for designing and implementing such real-time simulations, which will be a key component of the next generation of clinical trials.
3.2 Growth of Advanced Statistical Methodologies and Regulatory Encouragement
As computing capabilities grew, so too did the development of novel statistical methodologies for clinical trials. From approximately the 1980s onward, statisticians began proposing innovative trial designs and analysis methods that could make drug development more efficient and informative. Two prominent examples are adaptive trial designs and the increasing use of Bayesian statistical methods.
Innovative designs represented a break from the fixed, one-size-fits-all designs that had dominated clinical research since the standardization of randomized controlled trials in the post-war era. At the same time, the industry has recognized the increasing cost for drug development, and the need to improve the efficiency of drug development. However, the uptake of such innovations in industry was initially slow – until regulatory bodies, and particularly FDA, actively encouraged their adoption. Regulatory guidance has been a crucial catalyst in legitimizing and accelerating the use of advanced methods by pharmaceutical statisticians (U.S. Food and Drug Administration, 2019, 2023c; International Council for Harmonisation, 2025). Adaptive designs allow pre-planned modifications to certain aspects of a clinical trial (such as sample size, randomization ratios, or even treatment arms) based on interim analysis of accumulating data. The conceptual appeal of adaptive trials is clear: they can make clinical research more flexible and efficient, potentially finding effective treatments faster or using fewer patients (Pallmann et al., 2018). For example, an adaptive trial might start with multiple dose groups and use interim results to seamlessly drop ineffective doses or reallocate more patients to promising treatments, rather than sticking to a static design. By utilizing ongoing results, adaptive designs can ethically benefit patients (more patients get the better treatments) and scientifically improve the chance of trial success or reduce resources needed. These advantages were recognized in the statistical literature by the 1990s, but early on there was hesitation in the conservative regulatory environment to accept trials that depart from the traditional fixed protocol (Pallmann et al., 2018). This began to change in the 2000s and 2010s. A milestone was the FDA’s 2010 Draft Guidance on adaptive design, followed by a comprehensive FDA Guidance in 2019 explicitly outlining principles for adaptive trials in drug development (U.S. Food and Drug Administration, 2019). This guidance not only provided industry with a clear roadmap on how to plan and analyze adaptive trials rigorously, but also sent a strong signal that regulators welcome well-justified adaptive approaches, such as currently an ICH E20 guidance on adaptive design for clinical trials is underway to delineate the principles of adaptive designs and regulatory considerations (International Council for Harmonisation, 2025). Statisticians were central to this shift: they had to develop new statistical methods to ensure, for instance, that making mid-course modifications would not inflate the family-wise type I error (false positive rate). They also engaged in extensive simulations, as recommended by FDA, to demonstrate operating characteristics of adaptive designs before implementation. As a result of these efforts, adaptive designs are now increasingly common in clinical trials across therapeutic areas (from oncology to cardiology), and pharmaceutical statisticians have expanded responsibilities in designing interim analyses, setting adaptation rules, and liaising with Data Monitoring Committees. Indeed, adaptive methods have moved from an experimental idea to a mainstream tool, catalyzed by regulatory acceptance.
Bayesian methods have similarly grown in prominence. The Bayesian framework for data analysis offers an intuitive and flexible approach in which evidence is accumulated sequentially, and prior knowledge can be formally incorporated into current trial analysis. For decades, classical (frequentist) statistics dominated drug trials, but Bayesian statistics began gaining traction for problems where traditional methods were less efficient – such as trials in rare diseases or early-phase studies requiring use of prior data, as well as other applications in safety signal detection and evaluation (Xia et al., 2011; Xia and Price, 2014), and meta-experimental design and analysis (Ibrahim et al., 2012). Bayesian analyses can produce direct probability statements about treatment effects (e.g., the probability a drug is better than control), which are appealing to decision-makers, and can allow more continuous learning from data rather than an all-or-nothing hypothesis test. Bayesian methods, such as probability of study success (PrSS) evaluation, have been broadly used for internal decision making (Wang et al., 2013). However, adopting Bayesian approaches in regulated clinical trials required convincing both scientists and regulators of their validity and robustness. A key turning point was in the area of medical devices: in 2010, the FDA’s Center for Devices and Radiological Health released a Guidance for the Use of Bayesian Statistics in Medical Device Trials (U.S. Food and Drug Administration, 2010). This document explicitly acknowledged that Bayesian methods, when properly applied, could reduce required sample sizes or study durations by incorporating prior evidence, as well as offer other benefits in flexibility of trial design. Notably, by formally addressing “Why are Bayesian methods more commonly used now?” and similar questions, the FDA guidance clarified misconceptions and provided best practices for sponsors. This endorsement catalyzed a surge of interest in Bayesian designs not only for devices but eventually in drug trials as well. In drug development, Bayesian methods have seen increased use in exploratory Phase II trials, in adaptive dose-finding (e.g., Bayesian dose-escalation methods in oncology), and even in some confirmatory trials with regulatory acceptance (especially in rare disease settings where leveraging external or prior trial data is invaluable). For example, the pivotal Pfizer/BioNTech mRNA COVID-19 vaccine study (BNT162b2) employed a design and analysis framework described as Bayesian (Polack et al., 2020). Notably, in autoimmune disease development, Amgen’s programme for Systemic Lupus Erythematosus (SLE) entered the FDA’s Complex Innovative Trial Design (CID) pilot programme, proposing that endpoint will be evaluated using a Bayesian Hierarchical Model (BHM) with non-informative priors Food and Drug Administration (FDA) (2021). By 2024, a Lancet review noted that Bayesian statistics offers a flexible and informative approach that facilitates both design and interpretation of trials, and advocated for its broader use in clinical research (Goligher et al., 2024). The authors emphasized that owing to its different conception of probability, the Bayesian paradigm can incorporate evidence in ways that enrich inference and decision-making. It is telling that FDA leadership has also highlighted Bayesian and adaptive designs as promising innovations in the context of modernizing clinical trials (e.g., in discussions around the 21st Century Cures Act, which encouraged the exploration of novel trial designs and analytical methods for speeding therapy approvals) (Concato and Corrigan-Curay, 2022). Recently, the Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) launched the Bayesian Statistical Analysis (BSA) Demonstration Project to foster the use of Bayesian methods in “simple” phase-III drug trials (e.g., non-adaptive or sequential designs). The programme allows sponsors to use Bayesian analyses — either as the primary or a supplemental analysis — and offers regulatory interaction and methodological support (U.S. Food and Drug Administration, Center for Drug Evaluation and Research, Center for Clinical Trial Innovation (C3TI), 2025). In practice, statisticians’ roles have expanded to include mastering these advanced methodologies, educating project teams and regulators about them, and developing the technical justifications needed for their use. Where a 1970s-era statistician’s toolkit might not have extended far beyond t-tests and chi-squares, a statistician today might design a complex adaptive Bayesian trial with multiple interim looks and dynamic randomization, confident in its theoretical soundness and regulatory acceptability (Goligher et al., 2024; U.S. Food and Drug Administration, 2019), as well PUDUFA VII requirement (U.S. Food and Drug Administration, 2022b), and FDA is about to publish a draft guidance on Bayesian methods by the end of 2025.
Another example of methodological innovation is the emergence of master protocols (platform trials, basket trials, umbrella trials) which allow evaluation of multiple therapies and/or multiple diseases within a single trial infrastructure. These designs, which became especially prominent in the 2010s (notably in oncology), require sophisticated statistical coordination – for instance, sharing control groups, dropping or adding treatment arms on the fly, and possibly using Bayesian borrowing of information across sub-studies. Statisticians were instrumental in conceiving these designs, but their broad adoption was again facilitated by regulators. In 2017, Woodcock and LaVange from the FDA authored a New England Journal of Medicine review explaining the value of master protocols and providing a regulatory perspective on how to conduct them rigorously (Woodcock and LaVange, 2017). They illustrated that such designs can accelerate drug development by studying multiple hypotheses under a common protocol, but also cautioned on the statistical complexities that must be managed. Following this, the FDA issued a formal guidance on master protocol trials (in 2018 draft, finalized 2022), further cementing regulatory encouragement (U.S. Food and Drug Administration, 2022a). The net effect of these trends is that statisticians are now far more deeply involved in trial design strategy than before – they are not just answering “How do we analyze the data?” but also “What is the optimal way to design this study to begin with?”. As a 2010 industry review put it, statisticians in pharma have evolved into “full and equal partners with clinical and regulatory scientists” in trial planning and drug development strategy. This cultural shift, partly driven by the need to implement cutting-edge methods properly, means statisticians today often co-lead discussions on a program’s evidence generation plans. They ensure that innovative designs like adaptive and Bayesian trials are used appropriately and transparently, satisfying scientific rigor and regulatory standards. In summary, the growth of advanced methodologies – and crucially, the feedback loop of regulatory guidance and endorsement – has been a key catalyst expanding statisticians’ responsibilities. It pushed them into new roles: methodological innovators, architects of novel trial designs, and front-line communicators who articulate the benefits and limitations of these designs to regulators and clinical teams.
3.3 New Data Types and Large Datasets
The modern pharmaceutical landscape is awash with data sources that scarcely existed a few decades ago. In early times (1950s-1980s) clinical trial results were captured via paper-based Case Report Forms (CRFs) and were the primary data source for statisticians. With limited computation tools, the analysis was often restricted to basic descriptive statistics and simple statistics methods such as t-tests, ANOVA, and chi-squared tests. Later, longitudinal data from electronic CRFs and databases became more common, allowing for richer analyses of treatment effects over time, methods such as mixed-effects models, and more complex statistical techniques become standard. Nowadays, companies contend with real-world data from healthcare databases, genomic and other “omics” data from advanced laboratory technologies, and digital health data from wearable sensors and electronic patient devices. The advent of these new data types – often high-volume, high-velocity, and high-variety – has changed the statistician’s job. Statisticians have had to develop and adopt new methodologies for analyzing such data, expand their expertise into realms traditionally outside classical biostatistics, and often collaborate closely with experts in fields like bioinformatics and machine learning. In short, the rise of large, complex data sets has been another catalyst that broadened the statistician’s role from trial-centric analysis to a more holistic “clinical data science” role (Morris and Baladandayuthapani, 2017).
A key area is Real-World Data (RWD) and Real-World Evidence (RWE). RWD are data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources. It could include EHRs, claims and billing data, data from product and disease registries, patient-generated data including in home-use settings, and data gathered from other sources that can inform on health status, such as mobile devices. RWE is the clinical evidence regarding the usage and potential benefits or risks of a medical product derived from analysis of RWD (U.S. Food and Drug Administration, 2018; Concato and Corrigan-Curay, 2022). Historically, RWD was not heavily used in regulatory decisions due to concerns about bias and quality. However, efforts, especially in the U.S., have increased to use RWE for regulatory and clinical insights. The 21st Century Cures Act of 2016 required the FDA to explore RWE for drug approvals (U.S. Food and Drug Administration, 2018). By 2021-2022, the FDA had issued guidance on using RWE and approved some drugs based on real-world studies (U.S. Food and Drug Administration, 2024a, 2021a, 2023d). Statisticians play a crucial role in analyzing these datasets, dealing with issues like bias, confounding and missing data, and ensuring the data’s quality. They must also explain to regulators how observational data can approximate randomized trial evidence. This expansion means statisticians now work in outcomes research, safety surveillance, and policy. The FDA’s 2021 guidance on using electronic health records for regulatory decisions further expands their responsibilities, and FDA published a series of guidance documents related to RWD/RWE in regards to data, design, conduct and regulations since 2021 (U.S. Food and Drug Administration, 2021b, 2023b,d, 2024a).
Another important area is genomics and other ”omics” data. Advances like genome sequencing have introduced large-scale data to drug development. These data are complex and require sophisticated modeling. Statisticians have been key in developing tools to analyze this data, contributing to bioinformatics. They design experiments, preprocess data, and develop algorithms to identify important genes or biomarkers. As precision medicine grows, statisticians help identify patient subgroups with genomic markers that predict drug response. They also work on companion diagnostics, linking biomarkers to treatment outcomes. Their role has expanded from asking ”Does the drug work on average?” to ”For whom does the drug work?” This requires skills in multivariate modeling and machine learning, and collaboration with lab scientists. Statisticians also ensure data validity in new analytical domains (Morris and Baladandayuthapani, 2017).
Digital health data is another new area. Devices like smartphones and wearables collect real-time patient data, creating ”digital endpoints” in trials. These endpoints offer a more comprehensive view of patient health. Statisticians validate and analyze these endpoints, addressing challenges like data volume and missing data. They work with clinicians to ensure digital measures correlate with clinical benefits. The FDA has shown interest in digital health technologies, issuing guidance on using digital tools in trials. The COVID-19 pandemic increased the acceptance of digital endpoints. Statisticians now work with data scientists to refine algorithms and design trials with remote data capture.
Finally, underpinning all these new data domains is the rise of AI and machine learning (ML) techniques in drug development. Pharmaceutical companies are increasingly using ML models for tasks ranging from drug discovery (e.g., predicting molecule-target interactions) to patient/site selection and outcome prediction in clinical trials. This work is often led by statisticians, which often collaborate closely or lead the validation of such models. Notably, regulatory agencies have begun to acknowledge AI/ML in submissions. By 2025, the FDA reported seeing over 500 product submissions (across drugs and biologics) that incorporated AI/ML approaches, spanning discovery, trial optimization, and post-market safety analysis (U.S. Food and Drug Administration, 2025a). This marks a significant new responsibility for statisticians: evaluating and perhaps even developing predictive algorithms and ensuring they meet appropriate standards of evidence and lack undue bias. The FDA has encouraged sponsors to employ cutting-edge analytical tools – for example, using ML on real-world data to detect safety signals or to interpret complex endpoints – but with the expectation that they are rigorously assessed (U.S. Food and Drug Administration, 2025b). Statisticians thus find themselves contribute in cross-functional teams, bringing their expertise in validation: for instance, applying principled cross-validation, setting up prospective vali- dation studies for algorithms, and quantifying uncertainty in model predictions (Liu et al., 2023b; Morris and Baladandayuthapani, 2017). In essence, the data science revolution has not obviated the need for statisticians – it has expanded their purview. This breadth is a direct consequence of the influx of novel data types that require novel analytic thinking and methods.
4 Emerging AI Technologies in Pharmaceutical Research
Perhaps the most transformative catalyst in recent years has been the rise of artificial intelligence and machine learning in pharmaceutical research. AI technologies are reshaping how data are generated, analyzed, and even how trials are conducted. The following subsections present key trends.
4.1 Expanding Data Definition
In the past, pharmaceutical data was mostly just numbers in tables. Now, AI has expanded what we consider “data” to include things like molecular sequences, medical images, text, and even audio. Machine learning models can now learn from this unstructured information that we couldn’t analyze before. For example, large language models have been trained using diverse text sources like Wikipedia. Wikipedia, once just a simple reference, is now a key dataset for training large language models, showing how text can be turned into valuable scientific data (Devlin et al., 2019; Brown et al., 2020). Similarly, AI in biotechnology has used protein databases to create new functional proteins in a computer (Anishchenko et al., 2021; Watson et al., 2023). This means protein databases, once used for manual searches, are now used for AI-driven protein design, allowing us to create enzymes with specific functions from scratch (The UniProt Consortium, 2023; Berman et al., 2000). These examples show how the idea of “data” is growing and how new data types are driving unexpected advances in pharmaceuticals.
Statisticians play an important role in understanding this flood of digital data. AI gives statisticians the chance to work with complex datasets that were too difficult to analyze before. There’s a clear path for turning raw data into useful insights: (1) Digitalization – turning paper records into digital form; (2) Datafication – organizing these digital records so they can be analyzed; (3) Knowledgefication – finding patterns and insights from the data and to answer various what-if questions; and (4) Intelligencefication – using AI to recommend optimal decisions based on that knowledge. We see this happening in pharmaceutical research and development. Big companies have digitized years of clinical trial protocols, patient records, and regulatory documents. Once these documents are digitized, statisticians can start analyzing them, linking trial criteria to outcomes and study designs to success rates. This allows them to ask important questions like, ”What makes some trials succeed while others fail?” or ”How do certain criteria affect patient enrollment and outcomes?” Recent studies using real-world patient data have shown that many traditional trial restrictions don’t significantly affect outcomes, and relaxing these restrictions could increase the number of eligible patients without harming results (Liu et al., 2021). This is an example of knowledgefication – turning large, unstructured data
into insights that can improve trial design.
The last step, intelligencefication, is about to happen: using AI to make the best recommendations based on the knowledge gained. In pharmaceutical companies, this could mean AI helping design trials. For example, after learning from many past trials, an AI agent might suggest the best inclusion and exclusion criteria for enrolling patients to best differentiate treatment efficacy and safety, and it can also recommend the best schedule of activities to maximize trial success (Hutson, 2024). We can imagine AI tools that combine information from regulatory documents, scientific publications, conference abstracts, and early experiments to predict what concerns regulators might have about a new drug. Statisticians, with their skills in data analysis and experiment design, will be crucial in checking and using these AI recommendations. By leading the digitalization and analysis of diverse data, and by carefully evaluating AI’s suggestions, statisticians help ensure that the pharmaceutical industry’s new data is turned into reliable knowledge and smart actions. In short, the growth of ”data” in pharmaceutical research – from text and images to real-world patient data – is increasing the influence of statisticians both in organizing data and in making decisions, highlighting their role as key players in AI-driven research.
4.2 Go Beyond Traditional Statistical Methods
Modern AI is not just improving traditional statistics; it often surpasses them, opening new scientific areas. A great example is AlphaFold2 by DeepMind, which revolutionized how we predict protein structures. Before, predicting a protein’s 3D shape from its amino acid sequence required expert-crafted features and significant domain knowledge. AlphaFold2 changed this by using deep learning to directly predict structures from sequences, skipping the extensive human interventions. It achieved high accuracy, even without similar known structures, and matched experimental results for many targets. This breakthrough, published in Nature in 2021, showed that AI can learn complex biological patterns from data without needing detailed chemistry or physics rules (Jumper et al., 2021).
Crucially, this shift opens a new “swimming lane” for quantitative scientists with strong mathematical and programming skills. Complex biological and chemical problems that used to be the domain of specialized computational biologists are increasingly being tackled with general-purpose data-driven methods. Statisticians, given their training in rigorous modeling and algorithm development, are well positioned to contribute to this emerging domain often referred to as “digital biology.” The nature of work in digital biology (such as de novo protein design or ligand generation) often involves advanced mathematics and computation that go beyond classical computational chemistry training. For instance, modern generative models for molecular structures exploit concepts from differential geometry and Lie group to enforce physical symmetries (e.g. rotational or translational invariances of molecules) in the learning process. Geometric deep learning frameworks have been developed to handle data on non-Euclidean domains like protein surfaces or molecular graphs, encoding invariances under rotations/reflections by design (Bronstein et al., 2017; Fuchs et al., 2020; Garcia Satorras et al., 2021). Implementing and extending these models requires fluency in linear algebra, group representations, and high-performance computing – skill sets much more akin to those of statisticians or applied mathematicians than to traditional wet-lab scientists or computational biologists/chemists. In effect, drug discovery is becoming as much an algorithmic science. This creates ripe opportunities for statisticians to lead methodological innovation in areas like protein engineering and small-molecule drug design, where sophisticated modeling (rather than domain-specific intuition alone) drives breakthroughs.
Beyond AlphaFold, numerous other examples illustrate how algorithmic approaches are reshaping pharmaceutical R&D. For statisticians, each of these advances signals a domain where their expertise can be applied in novel ways: designing the modeling strategy, ensuring rigorous validation, and quantifying uncertainty in predictions. Notably, many such AI-driven discovery techniques emphasize prediction and optimization (e.g. finding a molecule that maximizes a predicted efficacy score) rather than classical inferential statistics. This highlights an important cultural shift that renowned statistician Leo Breiman presciently discussed in his “Two Cultures” essay two decades ago Breiman (2001). Breiman argued that much of traditional academic statistics focused on data models and inference under an assumed “true” model, whereas a different culture – exemplified by machine learning – focused on algorithmic models aimed at predictive accuracy. He urged statisticians to embrace this algorithmic approach for complex problems where the goal is often prediction or discovery, not estimating a pre-specified parameter. The current wave of AI in pharma is a testament to Breiman’s point: many breakthroughs (like protein folding or de novo molecule generation) are essentially large-scale prediction problems where flexible algorithms trump analytical formulas. Statisticians who adapt to this mindset – valuing predictive performance and computational experimentation alongside traditional inference – can substantially broaden their impact.
In summary, the rise of AI methods is pushing the boundaries of what quantitative scientists can do in pharmaceutical research. Statisticians equipped with strong coding abilities and mathematical depth are in an excellent position to drive these innovations. They can develop new algorithms, rigorously evaluate AI models, and ensure that these methods are applied soundly. By venturing beyond the confines of traditional statistical methodology – while still upholding standards of rigor and clarity – statisticians can become key players in cutting-edge domains like AI-driven drug discovery, precision medicine, and digital health. Their contributions will complement those of domain specialists, blending data-centric problem-solving with scientific insight to accelerate pharmaceutical progress.
4.3 Broadening Responsibilities and Impact of Statisticians Across the Value Chain
The role of statisticians in the pharmaceutical industry have expanded dramatically in recent years, evolving from a narrow focus on clinical trials to a broad involvement across the entire drug discovery and development lifecycle. Historically, a pharmaceutical statistician’s influence was largely confined to Phase II/III clinical development: designing trials, analyzing efficacy and safety data, and supporting regulatory submissions. Today, statisticians are increasingly embedded in cross-functional teams from early discovery and preclinical research, through manufacturing and quality control, all the way to post-marketing surveillance and health economics. This expansion is driven by the growing recognition that the statistician’s core skill set – quantitative reasoning, experimental design, data interpretation, and uncertainty quantification – is invaluable at every stage where data are generated and decisions are made. Enas and Andersen (2001) presciently noted that statisticians are uniquely trained to improve decision-making “from the very early stages of drug discovery until patients, payers and regulators are satisfied,” essentially advocating for statisticians to become key contributors in all phases of the enterprise. Two decades later, this vision is being realized. Statisticians now collaborate with chemists and biologists in discovery research, optimize processes with engineers in CMC (Chemistry, Manufacturing, and Controls) groups, and partner with physicians and epidemiologists to assess real-world outcomes post-approval. The modern pharmaceutical statistician often serves as a quantitative strategist, not only ensuring analyses are sound but also guiding what data to collect, how to collect it efficiently, and how to interpret it to drive business and scientific decisions.
Besides drug discoveries, the role of statisticians is rapidly evolving as AI technologies become integral to various fields beyond traditional statistics. In clinical development, statisticians are leveraging AI to enhance trial design and execution. Natural language processing algorithms are being used to analyze study protocols and electronic health records (Jin et al., 2024). Image technology and digital biomarkers are developed to help enrollment, particularly in complex oncology trials, by quickly finding eligible patients across extensive health networks. Moreover, AI is being used to simulate or augment control arms in trials through the creation of ”digital twins” – virtual patient avatars generated from historical data. This innovative approach augmented the data analysis, making trials more efficient and ethically palatable. Statisticians play a crucial role in validating these AI models to ensure they accurately represent patient outcomes and maintain scientific and regulatory rigor (Davi et al., 2020; Thorlund et al., 2022). In the manufacturing and supply chain sectors, AI is driving the transition to Pharma 4.0, a new paradigm of smart, data-driven production. Statisticians are collaborating with engineers to implement AI-based process monitoring and control systems. Machine learning models analyze process development data to identify optimal parameters and scaling conditions, accelerating the development process. AI-driven advanced process control systems can make real-time adjustments during production, ensuring critical quality attributes remain within target ranges. The FDA has acknowledged the potential of AI in drug manufacturing, highlighting its ability to reduce development time and waste through improved process design (U.S. Food and Drug Administration, 2023a). Statisticians are essential in deploying these advancements, from designing experiments to train AI models to validating their performance and integrating statistical process control with AI technologies.
In the realm of commercialization, AI and advanced analytics are empowering statisticians to drive better business decisions (Huanbutta et al., 2024). AI algorithms are used for demand forecasting and inventory optimization, analyzing historical sales and external data to predict drug demand accurately (Dong et al., 2009; Liu et al., 2023a). This helps reduce stockouts and oversupply, optimizing the pharma supply chain. In marketing, AI tools assist in segmenting healthcare providers and patients, tailoring outreach to those most likely to benefit. Predictive analytics guide field sales strategies by integrating data on prescribing habits and patient demographics, enhancing targeting precision (Dong et al., 2009; Manchanda and Chintagunta, 2004). Statisticians collaborate with AI to develop pricing strategies, using machine learning models to analyze market data and recommend optimal pricing (Fazekas et al., 2024). These advancements demonstrate that data-driven decision-making is becoming the norm in pharma, with statisticians translating AI-driven analytics into actionable business insights.
Finally, as the question of data collection for AI arises, statisticians will influence future data strategies too (ICH E9(R1) Expert Working Group, 2019). Traditionally, pharma’s data collection in trials was solely focused on regulatory approval of the molecule at hand. In the future, we envision that companies will deliberately collect data not just to advance the current product, but also to improve the next generation of AI models that assist in drug design and development. This might entail, for instance, designing clinical studies that also create high-quality datasets for machine learning (such as rich biomarker panels or digital sensor data), recognizing that these datasets could inform many programs beyond the original trial. Statisticians will be key in planning such dual-purpose studies, balancing immediate needs with the long-term value of data. Techniques like adaptive sampling and active learning – where the data collected is dynamically guided by algorithmic learning needs – could become part of trial design considerations. By advising on how to gather the most informative data for both human decision-making and machine learning, statisticians ensure that pharmaceutical data resources continuously feed the cycle of innovation.
In summary, AI are expanding what scientists can do in pharmaceutical research (Vamathevan et al., 2019; Topol, 2019). Solving today’s tough problems often requires complex models and heavy computation. Statisticians with strong coding and math skills are well- positioned to lead these innovations (Cruz Rivera et al., 2020; Liu et al., 2020; Collins et al., 2024). They can create new algorithms, evaluate AI models, and ensure these methods are used correctly. By moving beyond traditional statistics, statisticians can play key roles in AI-driven drug discovery, precision medicine, digital health, manufacturing and commercialization space (Vamathevan et al., 2019; Topol, 2019; Helleckes et al., 2023). Their work will complement that of domain experts, combining data-driven problem-solving with scientific insight to advance pharmaceutical research (Topol, 2019).
5 Conclusion
The role of statisticians in the pharmaceutical industry has undergone a remarkable evolution, expanding in scope, influence, and importance over the past 50 years. From the early days following the 1962 FDA reforms – when a handful of statisticians were brought in to ensure new drugs had statistically sound evidence of efficacy – to the present day where statisticians are at the forefront of AI-driven drug development, the transformation is pro- found. We have seen how historical milestones, such as regulatory changes (e.g. the 1980s NDA guidelines, ICH E9) and technological advances (the computing revolution, big data), set the stage for statisticians to move from the periphery to the core of decision-making in pharma.
Driving this evolution are key factors like statistical computing, the proliferation of new data types, which demanded novel analytical methods, and the willingness of industry and regulators to embrace innovative statistical designs that can make drug development more efficient. The recent surge of artificial intelligence has further catalyzed a paradigm shift, positioning statisticians as vital contributors to data science initiatives that span discovery through post-market use. This breadth of impact across the value chain - from molecule to market – exemplifies how the statistician’s remit has grown far beyond traditional boundaries.
Crucially, statisticians have not just grown in number or technical capability, but also in their leadership and collaborative roles. They are increasingly recognized as strategic partners who bring a data-driven lens to interdisciplinary teams. Whether it’s guiding a cross-functional team through the design of an adaptive platform trial, negotiating the use of a novel surrogate endpoint with regulators, or explaining to commercial colleagues how an observational study supports a product’s value proposition, statisticians are influencing critical decisions at every step (Woodcock and LaVange, 2017; Prentice, 1989; U.S. Food and Drug Administration, 1992, 2018; Franklin et al., 2023). The statistician often serves as the bridge between the company and regulators on complex methodological issues, an
intermediary role that has smoothed the adoption of things like complex innovative trial designs and real-world evidence considerations.
Looking to the future, the trajectory points toward statisticians continuing to be agents of innovation in pharmaceutical R&D. With the ongoing integration of AI, the growth of personalized medicine, and the increasing reliance on real-world data, there will be even greater demand for statisticians who can blend quantitative rigor with creativity and strategic thinking (Collins and Varmus, 2015; U.S. Food and Drug Administration, 2024b; Sadybekov and Katritch, 2023). We can anticipate statisticians playing leading roles in the effort of quantitative decision making for every single step in pharmaceutical research. Realizing these opportunities will require concerted effort in training and professional development. As discussed, academia and industry have to work together to equip statisticians with a modern skill set that includes evolving technical skills on advanced modeling, machine learning, statistical computing (in particular for high performance parallel computing), algorithms, mathematical optimization, and software engineering basics (Pitman et al., 2019). The curriculum adjustments and competency development recommended in this paper are intended to future-proof the profession.
In conclusion, the evolving role of statisticians in pharma is a success story of how a profession can adapt and expand to meet new challenges. Statisticians have leveraged advanced analytics and AI not to replace their traditional work, but to augment and elevate it, driving better decisions and outcomes. They have transitioned from behind-the-scenes advisers to frontline leaders ensuring that evidence and data quality remain the bedrock of pharmaceutical innovation. The fruits of this evolution are evident: more efficient trials, more robust evidence of drug benefits and risks, and ultimately, a more informed approach to bringing therapies to the patients who are waiting for us. Our evolving role will continue to be characterized by leadership, innovation, and an unwavering commitment to using data for the betterment of public health.
References
1000 Genomes Project Consortium (2015), A global reference for human genetic variation, Nature, 526, 68–74.
Anishchenko, I., Pellock, S. J., Chidyausiku, T. M., Ramelot, T. A., Ovchinnikov, S., Hao, J., Bafna, K., Norn, C., Kang, A., Bera, A. K., DiMaio, F., Carter, L., Chow, C. M., Montelione, G. T., and Baker, D. (2021), De novo protein design by deep network hallucination, Nature.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000), The Protein Data Bank, Nucleic Acids Research, 28, 235–242.
Bothwell, L. E. and Podolsky, S. H. (2016), The emergence of the randomized, controlled trial, New England Journal of Medicine, 375, 501–504.
Breiman, L. (2001), Statistical Modeling: The Two Cultures, Statistical Science, 16, 199–231.
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017), Geometric Deep Learning: Going Beyond Euclidean Data, IEEE Signal Processing Magazine, 34, 18–42.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., et al. (2020), Language Models are Few-Shot Learners, in Advances in Neural Information Processing Systems 33, https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Chalmers, I. (2003), Fisher and Bradford Hill: theory and pragmatism?, International journal of epidemiology, 32, 922–924.
Chambers, J. M. (1998), Programming with data: A guide to the S language, Springer Science & Business Media.
Chen, X., He, R., Chen, X., Jiang, L., and Wang, F. (2023), Optimizing dose-schedule regimens with Bayesian adaptive designs: opportunities and challenges, Frontiers in Pharmacology, 14, 1261312.
Chuang-Stein, C., Bain, R., Branson, M., Burton, C., Hoseyni, C., Rockhold, F., Ruberg, S., and Zhang, J. (2010a), Statisticians in the pharmaceutical industry: the 21st century, Statistics in Biopharmaceutical Research, 2, 145–152.
Chuang-Stein, C., Fritsch, K., Smith, B., and Zhang, J. (2010b), The role of statisticians in the pharmaceutical industry in the 21st century, Statistics in Biopharmaceutical Research, 2, 3–8.
Collins, F. S. and Varmus, H. (2015), A New Initiative on Precision Medicine, New England Journal of Medicine, 372, 793–795.
Collins, G. S., Moons, K. G. M., Dhiman, P., Riley, R. D., Beam, A. L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J. B., van Smeden, M., et al. (2024), TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods, BMJ, 385, e078378.
Concato, J. and Corrigan-Curay, J. (2022), Real-World Evidence ? Where Are We Now?, New England Journal of Medicine, 386, 1680–1682.
Crofton, J. (2006), The MRC randomized trial of streptomycin and its legacy: a view from the clinical front line, Journal of the Royal Society of Medicine, 99, 531–534.
Cruz Rivera, S., Liu, X., Chan, A.-W., Denniston, A. K., Calvert, M. J., Moher, D., et al. (2020), Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension, Nature Medicine, 26, 1351–1363.
Davi, R., Mahendraratnam, N., Chatterjee, A., Dawson, C. J., and Sherman, R. (2020), Informing single-arm clinical trials with external controls, Nature Reviews Drug Discovery, 19, 661–662.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019), “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT 2019, pp. 4171–4186.
Dong, X., Manchanda, P., and Chintagunta, P. K. (2009), Quantifying the benefits of individual-level targeting in the presence of firm strategic behavior, Journal of Marketing Research, 46, 207–221.
Efron, B. (1979), Bootstrap Methods: Another Look at the Jackknife, The Annals of Statistics, 7, 1–26.
Enas, G. G. and Andersen, J. S. (2001), Enhancing the value delivered by the statistician throughout drug discovery and development: putting statistical science into regulated pharmaceutical innovation, Statistics in Medicine, 20, 2697–2708.
Fazekas, M., Veljanov, Z., and de Oliveira, A. B. (2024), Predicting pharmaceutical prices: advances based on purchase-level data and machine learning, BMC Public Health, 24, 1888.
Food and Drug Administration (FDA) (2021), CID Case Study: A Study in Patients with Systemic Lupus Erythematosus, Technical report, FDA Complex Innovative Trial Design (CID) Pilot Program, https://www.fda.gov/media/155404/download.
Franklin, J. M., Patorno, E., Desai, R. J., et al. (2023), Emulation of Randomized Clinical Trials With Nonrandomized Database Analyses: Results From 32 Clinical Trials (RCT- DUPLICATE), JAMA, 329, 1375–1385.
Fuchs, F. B., Worrall, D., Fischer, V., and Welling, M. (2020), SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks, in Advances in Neural Information Processing Systems 33, https://proceedings.neurips.cc/.
Garcia Satorras, V., Hoogeboom, E., and Welling, M. (2021), E(n) Equivariant Graph Neural Networks, in Proceedings of the 38th International Conference on Machine Learning, vol. 139 of PMLR, pp. 9323–9332.
Gelfand, A. E. and Smith, A. F. M. (1990), Sampling-Based Approaches to Calculating Marginal Densities, Journal of the American Statistical Association, 85, 398–409.
Goligher, E. C. et al. (2024), Bayesian Statistics for Randomised Trials: A Primer for Clinicians, The Lancet, 403, 483–495.
Goodrich, W. W. (1963), FDA’s Regulation under the Kefauver-Harris Drug Amendments of 1962, Food Drug Cosm. LJ, 18, 561.
Guideline, I. H. T. (1999), Statistical principles for clinical trials. International conference on harmonisation E9 expert working group, Stat. Med., 18, 1905–1942.
Harrer, S., Shah, P., Antony, B. J., and Hu, J. (2019), Artificial Intelligence for Clinical Trial Design, Trends in Pharmacological Sciences, 40, 577–591.
Helleckes, S., Adkins, B., Worth, C., Hackl, M., Ajaz, S. A., Labrador, A., Maurer, M., Ogunnaike, B. A., ˚Akesson, J., Forbes, P. T., and Abu-Absi, S. F. (2023), Machine learning in bioprocess development: from promise to practice, Trends in Biotechnology, 41, 817–835.
Huanbutta, K., Burapapadh, K., Kraisit, P., Sriamornsak, P., Ganokratanaa, T., Suwanpitak, K., and Sangnim, T. (2024), Artificial intelligence-driven pharmaceutical industry: A paradigm shift in drug discovery, formulation development, manufacturing, quality control, and post-market surveillance, European Journal of Pharmaceutical Sciences, 203, 106938.
Hutson, M. (2024), Cutting to the chase, Nature, 627, S2–S5.
Ibrahim, J. G., Chen, M.-H., Xia, H. A., and Liu, T. (2012), Bayesian meta-experimental design: evaluating cardiovascular risk in new antidiabetic therapies to treat type 2 diabetes, Biometrics, 68, 578–586.
ICH E9(R1) Expert Working Group (2019), Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials E9(R1), E9-R1_Step4_Guideline_2019_1203.pdf.
Ihaka, R. and Gentleman, R. (1996), R: a language for data analysis and graphics,
Journal of computational and graphical statistics, 5, 299–314.
International Council for Harmonisation (2009), ICH Q8(R2): Pharmaceutical Development, Step 4 Guideline, https://database.ich.org/sites/default/files/Q8_R2_Guideline.pdf.
International Council for Harmonisation (2020), “ICH Q9(R1): Quality Risk Management,” Adopted Guideline, https://www.ema.europa.eu/en/documents/scientific-guideline/ ich-q9-r1-quality-risk-management_en.pdf.
International Council for Harmonisation (2025), “ICH E20: Adaptive Designs for Clinical Trials (Step 2b Draft Guideline), https://www.ema.europa.eu/en/documents/scientific-guideline/ ich-e20-guideline-adaptive-designs-clinical-trials-step-2b_en.pdf.
International Human Genome Sequencing Consortium (2001), Initial sequencing and analysis of the human genome, Nature, 409, 860–921.
Jin, Q., Wang, Z., Floudas, C. S., et al. (2024), Matching patients to clinical trials with large language models, AI in Precision Oncology.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. (2021), Highly accurate protein structure prediction with AlphaFold, Nature, 596, 583–589.
Lind, J. (1753), A treatise of the scurvy in three parts, Kincaid.
Liu, P., Wang, Z., Liu, N., and Peres, M. A. (2023a), A scoping review of the clinical application of machine learning in data-driven population segmentation analysis, Journal of the American Medical Informatics Association, 30, 1573–1582.
Liu, Q., Huang, R., Hsieh, J., Zhu, H., Tiwari, M., Liu, G., Jean, D., ElZarrad, M. K., Fakhouri, T., Berman, S., Dunn, B., Diamond, M. C., and Huang, S.-M. (2023b), Landscape Analysis of the Application of Artificial Intelligence and Machine Learning in Regulatory Submissions for Drug Development From 2016 to 2021, Clinical Pharmacology & Therapeutics, 113, 771–774.
Liu, R., Rizzo, S., Whipple, S., Pal, N., Lopez Pineda, A., Lu, M., Arnieri, B., Lu, Y., Capra, W., Copping, R., and Zou, J. (2021), Evaluating eligibility criteria of oncology trials using real-world data and AI, Nature, 592, 629–633.
Liu, X., Cruz Rivera, S., Moher, D., Calvert, M., Denniston, A., SPIRIT-AI, and Group, C.-A. W. (2020), Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension, Nature Medicine, 26, 1364–1374.
Manchanda, P. and Chintagunta, P. K. (2004), Responsiveness of physician prescription behavior to salesforce effort: An individual-level analysis, Marketing Letters, 15, 129– 145.
Meadows, M. (2006), Promoting safe and effective drugs for 100 years
Moore, G. E. (1965), Cramming More Components onto Integrated Circuits, Electronics, 38, 114–117, https://archive.computerhistory.org/resources/text/Intel/ Moore/Intel_Moore_1965_Article_08_19_65.pdf.
Morris, J. S. and Baladandayuthapani, V. (2017), Statistical contributions to bioinformatics: Design, modelling, structure learning and integration, Statistical modelling, 17, 245–289.
Pallmann, P., Bedding, A. W., Choodari-Oskooei, B., Dimairo, M., Flight, L., Hampson,
L. V., Holmes, J., Mander, A. P., Sydes, M. R., Villar, S. S., and Wason, J. (2018), Adaptive designs in clinical trials: why use, what is needed and how to proceed, Journal of the Royal Statistical Society: Series A (Statistics in Society), 181, 403–410.
Pitman, A., Sverdlov, O., and Pearce, L. B. (2019), Mathematical and Statistical Skills in the Biopharmaceutical Industry: A Pragmatic Approach, Chapman and Hall/CRC.
Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., Perez,
J. L., Perez Marc, G., Moreira, E. D., Zerbini, C., et al. (2020), Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine, New England journal of medicine, 383, 2603–2615.
Prentice, R. L. (1989), Surrogate Endpoints in Clinical Trials: Definition and Operational Criteria, Statistics in Medicine, 8, 431–440.
Rockhold, F. W. (2000), Strategic use of statistical thinking in drug development, Statistics in medicine, 19, 3211–3217.
Rodda, B., Millard, S. P., and Krause, A. (2001), Statistics and the Drug Development Process, in Applied Statistics in the Pharmaceutical Industry: With Case Studies Using S-Plus, Springer, pp. 3–14.
Sadybekov, A. V. and Katritch, V. (2023), Computational Approaches Streamlining Drug Discovery, Nature, 616, 673–685.
Segreti, A. C., Leung, H. M., Koch, G. G., Davis, R. L., Mohberg, N. R., and Peace, K. E. (2001), Biopharmaceutical Statistics in a Pharmaceutical Regulated Environment: Past, Present, and Future, Journal of Biopharmaceutical Statistics, 11, 347–372.
Senn, S. S. (2021), Statistical Issues in Drug Development, Hoboken, NJ: John Wiley & Sons, 3rd ed.
The UniProt Consortium (2023), UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Research, 51, D523–D531.
Thorlund, K., Dron, L., Park, J., and Mills, E. J. (2022), External control arms in oncology: current use and future directions, Annals of Oncology, 33, 577–585.
Topol, E. J. (2019), High-performance medicine: the convergence of human and artificial intelligence, Nature Medicine, 25, 44–56.
U.S. Food and Drug Administration (1985), New Drug and Antibiotic Regulations; Final Rule (NDA Rewrite), Federal Register 50 FR 7452 (February 22, 1985), https://www. federalregister.gov/citation/50-FR-7452.
U.S. Food and Drug Administration (1988), Guideline for the Format and Content of the Clinical and Statistical Sections of an Application, https://www.fda.gov/media/71436/download.
U.S. Food and Drug Administration (1992), New Drug, Antibiotic, and Biological Drug Product Regulations; Accelerated Approval, Federal Register 57 FR 58942 (21 CFR 314 Subpart H; 21 CFR 601 Subpart E), https://www.govinfo.gov/content/pkg/FR-1992-12-11/pdf/ FR-1992-12-11.pdf.
U.S. Food and Drug Administration (2004), Challenge and opportunity on the critical path to new medical products, http://www. fda. gov/oc/initiatives/criticalpath/.
U.S. Food and Drug Administration (2010), Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials, https://www.federalregister.gov/documents/2010/02/08/2010-2596/guidance-for-industry-and-food-and-drug-administration-guidance-for-the-use-of-bayes
U.S. Food and Drug Administration (2018), Framework for FDA’s Real-World Evidence Program, https://www.fda.gov/media/120060/download.
U.S. Food and Drug Administration (2019), Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry, FDA Guidance, https://www.fda.gov/media/78495/download.
U.S. Food and Drug Administration (2021a), FDA Approves New Use of Prograf (tacrolimus) Based on Real- World Evidence, https://www.fda.gov/drugs/news-events-human-drugs/fda-approves-new-use-transplant-drug-based-real-world-evidence.
U.S. Food and Drug Administration (2021b), Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products (Draft Guidance), https://www.federalregister.gov/documents/2021/09/30/2021-21315/real-world-data-assessing-electronic-health-records-and-medical-claims-data-to-suppo
U.S. Food and Drug Administration (2022a), Master Protocols: Efficient Clinical Trial Design Strategies to Expedite Development of Oncology Drugs and Biologics (Guidance for Industry), https://www.fda.gov/regulatory-information/search-fda-guidance-documents/master-protocols-efficient-clinical-trial-design-strategies-expedite-development-onc
U.S. Food and Drug Administration (2022b), PDUFA VII (FY 2023–2027): Overview of Commitments and Implementation, https://www.fda. gov/drugs/cder-small-business-industry-assistance-sbia/prescription-drug-user-fee-amendments-pdufa.
U.S. Food and Drug Administration (2023a), Artificial Intelligence in Drug Manufacturing: Discussion Paper, Tech. rep., FDA, https://www.fda.gov/media/165743/download.
U.S. Food and Drug Administration (2023b), Considerations for the Use of Real-World Data and Real-World Evidence to Support Regulatory Decision-Making for Drug and Biological Products: Guidance for Industry, FDA Guidance, https://www.fda.gov/media/171667/download.
U.S. Food and Drug Administration (2023c), Interacting with the FDA on Complex Innovative Trial Designs for Drugs and Biological Products (Guidance/Program Resources), https://www.fda.gov/drugs/ development-approval-process-drugs/complex-innovative-trial-designs.
U.S. Food and Drug Administration (2023d), Real-World Evidence Submissions to the Center for Drug Evaluation and Research, https://www.fda.gov/science-research/real-world-evidence/ real-world-evidence-submissions-center-drug-evaluation-and-research.
U.S. Food and Drug Administration (2024a), Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products (Final Guidance), https://www.fda.gov/media/152503/download.
U.S. Food and Drug Administration (2024b), Real-World Evidence: Considerations for the Use of Real-World Data in the Assessment of the Effectiveness of Drugs and Biological Products ? Guidance for Industry, Tech. rep., CDER/CBER, https://www.fda.gov/media/180013/download.
U.S. Food and Drug Administration (2025a), Artificial Intelligence for Drug Development, https:// www.fda.gov/about-fda/center-drug-evaluation-and-research-cder/ artificial-intelligence-drug-development.
U.S. Food and Drug Administration (2025b), Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products (Draft Guidance), https://www. fda.gov/media/184830/download.
U.S. Food and Drug Administration, Center for Drug Evaluation and Re- search, Center for Clinical Trial Innovation (C3TI) (2025), Bayesian Statistical Analysis (BSA) Demonstration Project,” https://www.fda.gov/about-fda/cder-center-clinical-trial-innovation-c3ti/ bayesian-statistical-analysis-bsa-demonstration-project.
Vamathevan, J., Clark, D., Czodrowski, P., Dunham, I., Ferran, E., Lee, G., Li, B., Madabhushi, A., Shah, P., Spitzer, M., et al. (2019), Applications of machine learning in drug discovery and development, Nature Reviews Drug Discovery, 18, 463–477.
Wang, Y., Fu, H., Kulkarni, P., and Kaiser, C. (2013), Evaluating and utilizing probability of study success in clinical development, Clinical Trials, 10, 407–413.
Watson, J. L., Juergens, D., Bennet, N. R., Trippe, B. L., Yim, J., Tischer, D., others, and Baker, D. (2023), De novo design of protein structure and function with RFdiffusion, Nature.
Woodcock, J. and LaVange, L. M. (2017), Master Protocols to Study Multiple Therapies, Multiple Diseases, or Both, The New England Journal of Medicine, 377, 62–70.
Wouters, O. J., McKee, M., and Luyten, J. (2020), Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009–2018, JAMA, 323, 844– 853.
Xia, H. A., Ma, H., and Carlin, B. P. (2011), Bayesian hierarchical modeling for detecting safety signals in clinical trials, Journal of Biopharmaceutical Statistics, 21, 1006–1029.
Xia, H. A. and Price, K. L. (2014), Bayesian Applications for Drug Safety Evaluation, Quantitative Evaluation of Safety in Drug Development: Design, Analysis and Reporting, 251.


