AI-generated systematic reviews – are they possible?

AI-assisted research is an exciting prospect – no more laborious keyword searches, scanning abstracts, and organizing PDFs – but do they live up to the promise? I explored this question using Elicit’s Systematic Review tool. I found a Cochrane Review on a topic and then asked Elicit the same question. I compared the two on retrieval of relevant studies, selection of studies to include, extraction of data, assessment of risk of bias, synthesis of evidence, and assessment of evidence quality. I found the report said, qualitatively and approximately, a similar thing for some primary and secondary outcomes as the Cochrane review. The Elicit report could be used by an expert to reach a similar high-level understanding as they would have gotten from the Cochrane review, but the most important between-study comparisons were not made within the report. The tool was not able to assess included studies for risk of bias, pool data across studies, or assess evidence quality. Differences were also found in study selection and detail reported for key findings.

What is a systematic review?

According to Cochrane, “a systematic review(SR) attempts to identify, appraise and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a specific research question. Researchers conducting SRs use explicit, systematic methods that are selected with a view aimed at minimizing bias, to produce more reliable findings to inform decision making.” The systematic approach to review lends itself well to replication of SRs via LLM, though some aspects of SRs are not currently available from any provider – pooling of data, assessment of heterogeneity, and assessment of bias. There are several standards of practice that Cochrane outlines in order to ensure high-quality results that are sufficient not only to answer an individual’s question, but also to support changes in policy or clinical practice affecting thousands or millions of people. These standards cover the following areas:

How are systematic reviews used?                                            

A SR is written for an audience of clinicians, policymakers, researchers, and the general public, and they use them to make informed decisions about the care of those in their practice, their own care, to set policy and clinical guidelines, and to inform future directions for research. A key point here is that the SR is used not only for individual purposes, but to support changes in practice that could affect large numbers of people, many more than just the person reading the review. A tool which can provide a similar answer to the answer arrived at through the SR process can be useful individually and still insufficient for use in policy or clinical practice because the process for generating the SR has been developed in consultation with the audience, who have collectively agreed that the process yields the best results. Unless and until the same thing happens with AI-generated SRs, they meet the criteria for individual use but not for use in policy or clinical practice. In this replication, I compare a Cochrane SR with an Elicit report and discuss the differences.

Methods

A Cochrane SR on “Immune checkpoint inhibitors plus platinum‐based chemotherapy compared to platinum‐based chemotherapy with or without bevacizumab for first‐line treatment of older people with advanced non‐small cell lung cancer” was chosen for the replication. I chose this because it’s an area that has been well-studied in a systematic fashion, so there is likely to be a large amount of research available and a broad distribution of evidence quality. The information was extracted using a combination of manual effort and LLM assistance. Then, two different Elicit reports were created: Elicit’s summary of PDFs cited in the report and Elicit’s search, screening, and summary of relevant content given the same question as the Cochrane SR (Elicit-PDF and Elicit-catalog). To get the PDFs from the Cochrane SR, DOIs corresponding to the references in the SR were obtained by matching the text references with DOIs through Crossref’s Simple Text Query tool. PDFs associated with each DOI were obtained where freely available. Where the PDF was not freely available and the study was one of the 17 RCTs included in the Cochrane SR, the abstract was pasted into a Google Docs template and saved as a PDF so it could be made available to Elicit. For Elicit-PDF, the report was initiated using the “Start a systematic review” option and the PDFs were uploaded into the Elicit. No suggestions from the Elicit catalog were added. For Elicit-catalog, no PDFs were added and the 500 results, 40 studies option was chosen. Screening and extraction prompts were the same across both versions of reports, except for a screening prompt intended to flag poor quality results that was adapted to suit the different versions. No ranking results were overridden, except in Elicit-catalog, where the Cochrane SR itself was excluded, a review that was erroneously screened in was excluded, and a relevant RCT that was ranked low for reasons that were not immediately apparent was included.

Results

Top-level results

Elicit-PDF and the Cochrane SR reports said, qualitatively and approximately, a similar thing for some primary and secondary outcomes. The Elicit-catalog report could be used by an expert to reach a similar high-level understanding, but the most important between-study comparisons were not made within the report.

Cochrane

The primary outcomes for most studies were overall survival and progression-free survival. Secondary outcomes were Objective Response Rate and, for some studies, safety. (OS and PFS were secondary outcomes for some studies.)

Overall survival

  • 65+
    • Hazard ratio(HR) 0.78 for Platinum + ICI vs Platinum alone, 95% CI 0.70 – 0.88, p < 0.001, I2 = 0%, 8 studies, 2093 pooled participants, moderate-certainty evidence.
  • 65-75
    • HR 0.75, 95% CI 0.65 – 0.87, p < 0.001, I2 – 0%, 6 studies, 1406 pooled participants, moderate-certainty evidence. Results were not sensitive to the inclusion of Paz-Ares 2021.
  • 75+
    • HR 0.90, 95% CI 0.70 – 1.16, p < 0.41, I2 = 0%, 4 studies, 297 participants, low certainty evidence. Results were not sensitive to the inclusion of Paz-Ares 2021.

Authors Interpretation

Compared to platinum-based chemotherapy alone, adding ICIs to platinum-based chemotherapy probably leads to higher overall survival and progression-free survival, without an increase in treatment-related adverse events (grade 3 or higher), in people 65 years and older with advanced NSCLC. These data are based on results from studies dominated by participants between 65 and 75 years old. However, the analysis also suggests that the improvements reported in overall survival and progression-free survival may not be seen in people older than 75 years.

Elicit-PDF

Key Findings

  • Overall survival
    • 33 studies showed HRs < 1, suggesting a reduced risk of death with the treatment combination compared to control.
  • Progression-free survival
    • 5 studies reported HR < 0.51
    • 8 studies reported 0.5 ≤ HR < 0.6
    • 7 studies reported 0.6 ≤ HR < 0.7
    • 2 studies reported 0.7 ≤ HR < 0.8
    • 1 study reported HR ≥ 0.8
  • Safety
    • Most studies reported a higher incidence of adverse events in the combination therapy groups compared to chemotherapy alone, but these were generally manageable.
    • Included some specific examples

Elicit interpretation

The effectiveness of immune checkpoint inhibitors plus platinum-based chemotherapy in older adults with advanced non-small cell lung cancer remains unclear due to limited representation and lack of specific outcome data for this population in clinical trials

Elicit-catalog

Key Findings

  • Outcomes
    • Progression-free survival (PFS): 13/15 studies
    • Overall survival (OS): 8/15 studies
    • Objective response rate (ORR): 7/15 studies
  • Safety
    • We found Grade ≥3 AEs data for 9/15 studies
    • We found Treatment Discontinuation data for 2/15 studies
    • We found Immune-related AEs data for 3/15 studies
  • Some specific details for individual papers were given.

Elicit Interpretation

Adding immune checkpoint inhibitors to platinum-based chemotherapy extended progression-free survival by 2 months and improved response rates by 12.8% in older patients, but led to more adverse events.

Elicit-PDF vs Cochrane

Inclusion criteria

Inclusion criteria were similar by definition, though with the Cochrane SR, three independent reviewers made the assessment to control for reviewer error or bias. The Cochrane SR also searched more sources, including clinicaltrials.gov.

Collection of data

Data collection was also similar by definition, but with the Cochrane SR, three independent reviewers did the extraction to correct for error. In addition, the reviewers reached out to labs to ask for unreported data necessary to pool studies, assess study quality, or estimate risk of bias.

Assessment of study quality and risk of bias

The Cochrane SR contained a substantial and detailed discussion of the completeness of data on outcomes (overall survival and progression-free survival), adverse events, age, demographics, mutational burden, and histological subtype.

 The Elicit-PDF report mentioned the lack of data on quality of life (a secondary outcome), age, and adverse events, but the mentions were superficial.

Example quotes

SR: “Furthermore, the  data  available  in  the  included  studies  were  insufficient  to address  the  outcome  of  treatment-related  adverse  events  (of grade  3  or  higher)  in  people  aged  65  years  and  older,  as  only one  study  with  potential  bias  provided  data  on  this  outcome. The  included  studies  did  not  provide  data  on  other  secondary outcomes, including time to response, duration of response, and HRQoL.”
Elicit-PDF: “Detailed quality of life data was limited in most studies, representing a gap in our understanding of the impact of these treatments on patients’ overall well-being.”

The Cochrane SR assessed the certainty of the evidence based on the risk of bias using the Cochrane Risk of Bias 2 tool.
The Cochrane SR assessed study quality according to the GRADE framework, as implemented by GRADEproGDT 2021.
The Cochrane SR assessed whether the pooled sample sizes or number of events was optimal for an outcome and the size of the confidence intervals.

The Cochrane SR presented the study quality data in tabular format and mentioned the quality of the evidence throughout the report when specific studies or groups of studies were discussed.

The Elicit-PDF report did not assess the quality of evidence, thus did not weight evidence or mention evidence quality, except for a few mentions of data being limited.

Synthesis of data

The Cochrane SR assessed effect heterogeneity according to the recommendations in Deeks 2021, as implemented in Review Manager 2024. Statistical heterogeneity was evaluated using the I2 stat and where values of I2 greater than 75% were found, subgroup and sensitivity analysis was performed. Where a reason for heterogeneity was not discovered, data pooling was not done and study results were presented in a narrative format. Meta-analysis was done on pooled data on an Intention-to-treat basis, using a random-effects model. A fixed effects model was used for primary outcomes for sensitivity analyses, and the generic inverse-variance method was used for the fixed-effect model for time-to-event outcomes. The Mantel-Haenszel method was used for dichotomous outcomes and the inverse-variance model was used for continuous outcomes. For some rare events, Peto’s odds ratio was used under the fixed-effect model. The DerSimonian and Laird method was applied for the random-effects model.

In the Elicit-PDF report, study results were not pooled statistically and were only discussed narratively.

Elicit-catalog vs Cochrane

The comparison of the Elicit-catalog report to Cochrane was similar, except for the selection criteria. It’s worth noting that the average Cochrane review takes a team of people about 18 months to write, much of this spent on parts of the process that isn’t currently part of an Elicit report, but the search alone can take months to be sure everything is included. In contrast, Elicit-catalog took a few days, most of that spent tweaking the screening and extraction prompts. However, the search isn’t guaranteed to be exhaustive. An Elicit report also has a hard cap on the number of results included due to technical limitations. This number has changed over time but was 500 for screening and 40 for extraction in the version used in this work.

Elicit-PDF vs Elicit-catalog

The Elicit-PDF and Elicit-catalog report said similar things, but came to different conclusions. The main high-level difference was that the Elicit-PDF report didn’t find enough data to make a clear recommendation about the effects in an older population, whereas the Elicit-catalog report did, only because it considered 65+ as the older population and didn’t separate out 75+.

The main difference in the substance of the report was that the Elicit-PDF report discussed results in terms of effects of the treatments that were compared in the trials, whereas the Elicit-catalog discussed results in terms of how many studies reported an outcome. This could be because the Elicit-catalog selections were more variable in terms of the populations and treatments and study designs, so it had to compare at a higher-level of granularity. In other words, the outcomes in Elicit-PDF were reported as months or percentages, whereas in Elicit-catalog, the result summaries were often given as numbers of studies, not sizes of effects.

Other differences

In Elicit-PDF, two tables of outcomes were presented, corresponding to the primary outcomes – overall survival and progression-free survival – whereas in Elicit-catalog, one table for outcomes was created.

In Elicit-PDF, safety information was not presented in a tabular format, but was discussed narratively, whereas in Elicit-catalog, a safety table was presented and summarized, leading to a better discussion of safety in Elicit-catalog.

In Elicit-PDF, a narrative discussion of age-related effects was presented, divided into efficacy and safety sections, whereas in Elicit-catalog, a shorter section was presented with safety results from a few individual studies..

In Elicit-catalog, a section commenting on some results with different drug combinations was also included.

Conclusion

The primary aim of this work was to recreate a Cochrane SR using Elicit. A secondary aim was to compare the quality of the search and screening, relative to that of a professional SR team. This design allowed us to test both the evidence synthesis and study selection aspects of Elicit separately. We were not able to recreate a Cochrane SR using Elicit, as it currently cannot pool data, assess evidence quality, effect heterogeneity, or risk of bias, which limits the tool to giving a narrative summary of individual studies instead of a synthesis of the evidence. This precluded certain comparisons for outcomes for different age groups. The lack of pooling of data resulted in conclusions that were significantly different in level of detail and evidentiary support for the conclusions, while still being able to offer high-level conclusions that were roughly similar, when Elicit was provided with the reviewer-selected studies. When Elicit had to both select and summarize the studies, the level at which the comparisons were done moved another step in the direction of less detail, comparing numbers of papers vs results within the papers, due to the included studies not being as focused on specific outcomes as the expert-curated studies. These differences imply the tool is suitable for some purposes and less suitable for others. For individual use to learn about the landscape of a research area and the major questions within it, Elicit performs very well and enables a practitioner or researcher to rapidly gain the understanding that it would otherwise take them days to weeks of reading and note-taking and synthesizing of knowledge. However, for professional use to generate evidence synthesis sufficient to change policy or clinical practice, two things remain missing. It is not currently able to synthesize results as a SR team would, though this seems achievable in practice as the product develops. The second thing is harder, because it is not internal to Elicit, but rather an aspect of the social structures in which it is embedded. Cochrane SRs change policy and practice because the community has agreed that the SR process is a valid way to generate evidence of sufficient quality to change practice. It’s not inconceivable that Elicit reports may someday serve this purpose, but it won’t be a product update that changes this, it will be the result of a deliberative process within the research and patient community.

Leave a Reply

Your email address will not be published. Required fields are marked *