Highlights:
AI will transform many aspects of health care;
Some applications of AI in healthcare research need to be validated before they are widely applied, including natural language processing;
Creation of credible real-world evidence regarding the effectiveness or comparative effectiveness of treatments from real-world data requires pre-specification of hypotheses in a protocol, fit-for-purpose real-world data sources, transparency in study design/execution and reporting, and the embedding of the protocol within a causal inference framework.
In his Ted Talk (https://drerictopol.com/can-ai-catch-what-doctors-miss/), Eric Topol opined that “AI will have the most transformative impact in the history of medicine.” Among the examples he discussed was the ability to diagnose apparently unrelated medical conditions from the simple EKG. He also discussed how AI can augment the analysis of retinal scans, colonoscopies, and x-rays to identify key findings that might be missed when solely reviewed by human health care practitioners.
As far as generative AI, there are many potential benefits including providing automated conversational responses to patient inquiries that could eventually usher in an era of facilitated self-service of simple conditions (1) and that may be more empathic and precise than physician communication. (2)
Much work currently focuses on the use of AI in simplifying the searching and summarization of the medical literature as well as the development of AI systems that will automatically record, summarize, and chart patient encounters freeing the clinician to focus on the patient and not completing the electronic medical record. All of this may be informed by the potential to automate knowledge/data mining of real-world data (RWD) to inform research, daily practice, the design of health benefits, and myriad other related activities.
If realized and validated, these advances promise a brave new world of medical practice. But they come with cautions. Generative AI using large language models (LLMs) currently present challenges including variability of results, hallucinations (ex. irrelevant or inaccurate responses), and data security risks. Regulatory agencies around the world, including the FDA (3,4) and EMA (5), have started publishing their thoughts on how to safely use AI, avoiding potential pitfalls.
With that as background, let us focus on the use of AI as a tool to augment and speed up the analysis of RWD which is fundamental to the creation of a “Learning Health Care System” – a system in which the divide between research and clinical practice is narrowed such that the data created by the health system can be routinely (or continuously) analyzed to improve the effectiveness and efficiency of health care delivery and improve outcomes for patients. In a sense, the system can “learn by doing” instead of only learning through formal studies.
In order for this to happen, it must be shown that the use of AI is at least as, if not more, accurate than traditional data analytic approaches, and this will have to be convincingly demonstrated to regulatory bodies, payers, and providers. A recent review published in the New England Journal of Medicine (6) compares traditional statistics with the use of AI. Rather than summarizing, it is perhaps better to quote from this article:
“Traditional statistical modeling uses careful hands-on selection of measurements and data features to include in an analysis — for example, which covariates to include in a regression model — as well as any transformation or standardization of measurements. Semiautomated data-reduction techniques such as random forests and forward- or backward selection stepwise regression have assisted statisticians in this hands-on selection for decades. Modeling assumptions and features are typically explicit, and the dimensionality of the model, as quantified by the number of parameters, is usually known. Although this approach uses expert judgment to provide high-quality manual analysis, it has two potential deficiencies. First, it cannot be scaled to very large data sets — for instance, millions of images. Second, the assumption is that the statistician either knows or is able to search for the most appropriate set of features or measurements to include in the analysis.
Arguably the most impressive and distinguishing aspect of AI is its automated ability to search and extract arbitrary, complex, task-oriented features from data — so-called feature representation learning. Features are algorithmically engineered from data during a training phase in order to uncover data transformations that are correct for the learning task. Optimality is measured by means of an “objective function” quantifying how well the AI model is performing the task at hand. AI algorithms largely remove the need for analysts to prespecify features for prediction or manually curate transformations of variables. These attributes are particularly beneficial in large, complex data domains such as image analysis, genomics, or modeling of electronic health records. AI models can search through potentially billions of nonlinear covariate transformations to reduce a large number of variables to a smaller set of task-adapted features. Moreover, somewhat paradoxically, increasing the complexity of the AI model through additional parameters, which occurs in deep learning, only helps the AI model in its search for richer internal feature sets, provided training methods are suitably tailored.”
Thinking more specifically about the current state-of-the-art in the analysis of RWD, comprised predominantly of administrative claims and electronic health record data, has AI really proven to be an advance? The answer thus far is a qualified “No” insofar as current data sources, even though they include millions of individual patients, have a relatively limited number of features. Thus, they can be readily analyzed by traditional methods and while there have been a number of examples where AI has been shown to outperform traditional methods, the differences have been modest. This may well change in the future as more and more sources of RWD are linked together and the feature space is greatly expanded. These sources may include patient-reported information, the full spectrum of laboratory results including images, environmental data, socioeconomic data, as well as data from sensors and wearables. As the number of features dramatically increase, traditional methods will not scale. However, AI will still need to address its “Black Box Problem”. The results must be explicable for their acceptance by health practitioners and regulators.
One of the more common applications of AI in RWD analysis is “natural language processing” of notes in electronic health records. The vast majority of data in EHRs are not coded. Transforming physician and other narratives into coded data is a paramount importance. The issue here is that there is little transparency by data source holders how this is done, and there is no consensus about standards to do so. At present, human review of notes by qualified personnel is considered the gold standard; such a process is time consuming and resource intensive. Significant progress in this area will be required.
In contrast, AI is being used to create “synthetic data”. RWD sources must protect the identity and personal health information of patients. Routinely, this is accomplished by “de-identification” of the data (ex. removing names, changing dates of birth, only reporting zip code locations, etc.). Interest has grown in creating synthetic data that mimics RWD, i.e. the statistical characteristics of patient subgroups are maintained, while the specific data has been changed. Here AI has been proven to be useful; however, while synthetic data can be more readily shared, its acceptability is controversial.
A separate, but critically important, consideration involves the emerging criteria for the production of credible real-world evidence (RWE) from RWD. The elements required include:
Fit-for-purpose RWD: Data sources must include the relevant patient populations in sufficient numbers. Data sources must be accurate and complete enough to support hypothesis testing (i.e. good quality). Much of currently available RWD sources are quite variable in quality, both in regards to accuracy and the sparseness of data.
Applying Good Observational Clinical Research Practices: Good Clinical Practices (GCPs) are spelled out in detail for RCTs and are required if their results are to be accepted by regulatory agencies. No such GCPs exist for observation clinical research. Because existing RWD sources can facilitate multiple analyses to cherry-pick desired results, RWD studies to support regulatory decision-making must post pre-specified protocols and analysis plans on public websites, and be fully transparent in how the data was acquired, curated, and analyzed. Ensuring transparency in the design and execution of RWD studies is essential to establishing their credibility and trustworthiness.
Employ a Causal Inference Framework: If RWD sources are being employed to discover what treatment works and for who, and how do different treatment options compare, the design of these studies must be framed by a causal inference framework such as Target Trial Emulation (TTE). (7) Critical work by Miguel Hernan and colleagues have shown errors in study design (ex. mis-identification of time zero, non-comparable lengths of follow-up) can lead to conclusions significantly at odds with a randomized controlled trial (RCT). (8) Employing the TTE framework has been demonstrated to enable replication of the results of large number of RCTs, and that differences can be attributed to design differences and how closely key variables in RWD sources can emulate those in the RCT. (9)
In summary, the interplay and AI and RWD is a complicated space in which the credibility of RWE derived will depend on a number of factors aside from statistical considerations. The use of RWE in medicine is dependent on its credibility and explicability. While AI holds much promise, its impact must await further developments in the standards of fit-for-purpose data and GCPs for observational studies. AI will also need to address its “black box” problem.
References
1. DA Asche, S. Nicholson, ML Berger Transforming Health Care Through Facilitated Self-Service, New England Journal of Medicine 2019: 380(20):1891-1893.
2. JW Ayers, A Poliak, M Dredze, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023;183(6):589–596. doi:10.1001/jamainternmed.2023.1838
3. U.S. Food and Drug Administration. Using Artificial Intelligence and Machine Learning in the Development of Drug and Biological Products https://www.fda.gov/ media/167973/download, Framework for Regulatory Advanced Manufacturing Evaluation (FRAME) https://www.fda.gov/about-fda/ center-drug-evaluation-and-research-cder/cders-framework-regulatory-advanced-manufacturing-evaluation-frame-initiative
4. U.S. Food and Drug Administration. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan https://www.fda.gov/media/145022/ download
5. Heads of Medicines Agencies and European Medicines Agency. Guiding principles on the use of large language models in regulatory science and for medicines regulatory activities. August 29, 2024.
6. DJ Hunter, C Holmes. Where medical statistics meets artificial intelligence. NEJM 389:13, 1211-1219.
7. HJ Hansford, AG Cashin, MD Jones, et al. Development of the TrAnsparent ReportinG of observational studies Emulating a Target trial (TARGET) guideline. BMJ Open 2023;13:e074626. doi:10.1136/bmjopen-2023-074626
8. BA Dickerman,X García-Albéniz, RW Logan, et al. Avoidable flaws in observational analyses: an application to statins and cancer. Nat Med 25, 1601–1606 (2019). https://doi.org/10.1038/s41591-019-0597-x
9. R Heyard, LHeld, S Schneeweiss, Design differences and variation in results between randomised trials and non-randomised emulations: meta-analysis of RCT-DUPLICATE data BMJ Medicine 2024; 3:e000709. doi: 10.1136/bmjmed-2023-000709