
AI Models May Serve as Scalable Adjunct to Oncology Documentation Workflows
Key Takeaways
- Advanced LLMs demonstrated high sensitivity for error detection in complex oncology documentation, including discrepancies likely to affect management, such as incorrect chemotherapy regimens or discordant laboratory values.
- In simulated discharge summaries, Gemini 2.5 Pro detected 97.8% of injected errors and GPT o4-mini-high detected 87.8%, compared with 47.8% mean detection by oncology specialists.
Study finds Gemini and GPT catch oncology chart errors, improving documentation accuracy and patient safety with clinician oversight.
Large language models (LLMs) may serve as a valuable supplement to oncology documentation workflows by detecting and correcting documentation errors in clinical records, according to a recent study published in JCO Clinical Cancer Informatics.1
In a 2-phase evaluation, investigators assessed the performance of contemporary advanced LLMs, namely Google’s Gemini 2.5 Pro and OpenAI’s GPT o4-mini, in identifying errors within complex hematology/oncology documentation. Across 1000 synthetic clinical scenarios, models demonstrated the ability to detect and, in some cases, correct inconsistencies involving diagnoses, treatment plans, and laboratory data.
“Advanced LLMs can serve as powerful assistants for clinical documentation reviews, substantially reducing the risk of oversight and clinician workload,” Peter May, MD, MPH and colleagues wrote in the publication.1 “Integrating LLM‐driven error flagging into electronic health record workflows offers a promising strategy for enhancing documentation accuracy, treatment quality, and patient safety in oncology.”
Key Findings: An Ability to Detect and Correct
The study found that the LLMs were able to identify a substantial proportion of documentation errors across simulated oncology cases, with performance exceeding that of human reviewers in several scenarios. Within complex discharge summaries, Gemini 2.5 Pro and GPT o4-mini-high identified 97.8% and 87.8% of injected errors, respectively, compared with a mean detection rate of 47.8% among human oncology specialists. In contrast, Gemma 3 27B, a local LLM, demonstrated lower sensitivity, detecting 35.6% of errors. Error detection included clinically relevant discrepancies that could plausibly influence patient management, such as incorrect chemotherapy regimens or discordant laboratory values.
In addition to detection, LLMs demonstrated partial capability in proposing corrections. However, accuracy varied depending on the complexity of the case and the type of error. Straightforward inconsistencies were more reliably addressed than nuanced clinical ambiguities.
Importantly, the models maintained contextual coherence in most cases, suggesting potential utility as a decision-support adjunct rather than a standalone system. The authors noted that even partial error detection could reduce cognitive burden on clinicians and mitigate risk in high-volume oncology practices.
Study Methodology
The analysis included 2 distinct phases. First, the authors evaluated LLM performance using standardized, synthetic oncology vignettes designed to reflect real-world documentation complexity. Second, the models were tested on more nuanced clinical scenarios to assess generalizability and robustness.
Performance metrics focused on sensitivity for error detection, qualitative accuracy of suggested corrections, and the ability to preserve clinically relevant context. The authors emphasized that oncology documentation presents particular challenges because of multimodal data inputs, evolving treatment regimens, and the need for precise staging and biomarker annotation.
Limitations and Clinical Implications
The reliance on synthetic clinical vignettes presents a potential study limitation, as it may not fully capture the variability and ambiguity of real-world oncology documentation. External validation in live clinical environments is necessary to establish generalizability. Additionally, performance may vary across different institutional documentation styles and electronic health record systems. The authors also highlighted the need for rigorous evaluation of bias, reproducibility, and data security before implementation into practice.
Overall, the findings highlight the potential of artificial intelligence (AI)-assisted review as a scalable approach to improving patient safety in high-risk oncology settings, where documentation inaccuracies can have downstream consequences for treatment decisions. Specifically, the integration of AI-assisted documentation review could provide an additional safety layer, with potential applications such as automated chart audits and real-time flagging of inconsistencies during documentation.
Despite its promise, the authors cautioned that LLM outputs still require clinician oversight. False positives and inappropriate corrections remain a concern, particularly in cases requiring nuanced clinical judgment. As such, these tools are best conceptualized as augmentative rather than autonomous systems.






































