Abstract
Abstract: The study introduces MedCheckLLM, a large language model–driven framework that enhances medical record evaluation through a guideline-in-the-loop approach by integrating evidence-based guidelines.
doi:10.2196/53335
Keywords
Introduction
Large language models (LLMs) have demonstrated enormous potential in assessing complex datasets in health care across many applications [
, ]. One underexplored area is their application for the reliable evaluation of medical documents. The automated evaluation of these documents has the potential to enhance patient safety. The system’s reasoning process must be (1) transparent and comprehensible to human evaluators and (2) guided by established medical guidelines proven to increase patient safety [ ].In this study, we introduce a framework that consists of a multistep approach for medical record evaluation that incorporates guidelines into the evaluation process (ie, guideline-in-the-loop). Our proposed algorithm, MedCheckLLM, is an LLM-driven, structured reasoning mechanism designed to automate the evaluation of medical records against evidence-based guidelines. The guidelines are deterministically accessed and returned to the LLM as input without further model fine-tuning. This strict separation of LLM and guidelines is expected to increase the validity and interpretability of the evaluations. The approach's step-by-step structure could improve transparency in clinical applications. The primary objective of this research is to introduce the conceptual framework and assess its feasibility.
Methods
The MedCheckLLM algorithm begins by extracting a patient’s diagnosis from the medical report (
). Based on the diagnosis, it suggests an appropriate guideline. A human medical expert makes the final guideline selection. Guidelines are then accessed independently of the LLM’s mechanisms using programmatically built interfaces for guideline retrieval. Subsequently, guidelines are provided as input to the LLM and are either identified as already formatted in a usable checklist or converted into a checklist. This diagnosis-specific checklist is used to assess the medical report by the LLM, with a final verification by a human medical expert. To test this approach, we used expert-validated simulated medical reports (simulated dataset) and physician-generated medical reports (physician dataset). Performance was analyzed for patient histories with headaches using guidelines from the International Headache Society and the physician dataset for four further neurological diagnoses (ie, border zone infarction, meningitis, neuromyelitis optica, and subarachnoid haemorrhage). The validity of this method was further analyzed by evaluating generated doctor’s notes with a correct diagnosis compared to doctor’s notes with a false diagnosis. The LLMs, GPT-4 and Claude-3 were used for testing (see ).
Results
We evaluated the medical report analysis conducted by MedCheckLLM for various headache diagnoses. In the simulated dataset, MedCheckLLM (based on GPT-4 and Claude-3,
) extracted the specified diagnosis correctly in 100% of cases from a list of 61 possible diagnoses from The International Classification of Headache Disorders-3 [ ]. The model suggested existing evidence-based guidelines in 70.59% (12/17) of medical reports and detected the format of the guidelines as checklists in 100% of the cases (N=17). MedCheckLLM accurately evaluated 87% (67/ 77) of checklist items. Performance on the physician dataset showed an accurate evaluation in 77.4% (24/ 31) of checklist items ( ). It identified incorrect diagnoses where the stated diagnosis did not align with the content of the doctor’s letters in 94.1% (16/17) of cases, while it correctly recognized 100% (N=17) of letters with matching diagnoses.Elements of algorithmic structure | GPT-4 performance, % (n/N) | Claude-3 performance, % (n/N) | Explanation of specific task of each element |
Extracting stated diagnosis | 100 (17/17) | 100 (17/17) | Extract the diagnosis that is stated in the medical report |
Suggestion of existing guidelines | 70.6 (12/17) | 58.8 (10/17) | Suggest a guideline that should be used to evaluate the medical report |
Detection of checklist | 100 (17/17) | 100 (17/17) | Detect whether the accessed guidelines are in a structured checklist- criteria format |
Evaluation of diagnostic criteria (checklist items) | 87 (67/77) | 83.8 (62/74) | Assess whether the criteria listed in the checklist are met in the medical report |
Evaluation of letters with correct diagnosis (clinical descriptions and diagnosis match) | 100 (17/17) | 94.1 (16/17) | Assess whether the diagnosis stated in the medical report aligns with the clinical descriptions |
Evaluation of letters with false diagnosis (clinical descriptions and diagnosis do not match) | 91.4 (16/17) | 91.4 (16/17) | Evaluate whether the diagnosis that is stated in the medical report fits the clinical descriptions |
Element of algorithmic structure | Stroke | Meningitis | Neuromyelitis optica | Subarachnoid hemorrhage |
Extracting stated diagnosis | Yes | Yes | Yes | Yes |
Suggestion of existing guidelines | Yes, applicable | Yes, partially applicable | Yes, applicable | Yes, partially applicable |
Creation of checklist, level of detail | Yes, moderate detail | Yes, moderate detail | Yes, thorough detail | Yes, minimal detail |
Evaluation of diagnostic criteria, % (n/N) | 100 (7/7) | 66.7 (4/6) | 87.5 (7/8) | 60 (6/10) |
aThe responses were classified as yes, and partially applicable,applicable, or minimal, moderate, or thorough detail.
Discussion
The framework of MedCheckLLM represents a promising approach for a comprehensive, guideline-anchored review of electronic health records. It holds the potential to function as a quality assurance framework throughout patient care due to its advantages of separate partitioning of the LLM and the guidelines, rather than training guidelines into an LLM. The flexibility of this approach allows for immediate implementation of guideline updates or the option to implement customized protocols for subgroups of patients. Due to the checklist-based approach, each item can be verified individually, thus increasing the algorithm’s interpretability, which is crucial in health care settings [
]. Due to the LLM’s subpar guideline suggestion capability, medical experts are integrated at this step to ensure that established guidelines are used. Further research is essential to advance the development of LLM-driven methods for extracting checklists from unstructured guidelines, as well-structured guidelines are crucial for detailed, high-quality checklists. Further, this framework facilitates improved data mining practices in electronic health records [ ]. In the future, it is crucial to address privacy concerns to ensure the ethical application of these powerful tools in real-world clinical settings [ - ].Conflicts of Interest
None declared.
References
- Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. Feb 2023;2(2):e0000198. [CrossRef] [Medline]
- Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination. JAMA Netw Open. Dec 1, 2023;6(12):e2346721. [CrossRef] [Medline]
- Thomassen O, et al. The effects of safety checklists in medicine: a systematic review. Acta Anaesthesiol Scand. Jan 2014;58(1):5-18. [CrossRef] [Medline]
- Headache Classification Committee of the International Headache Society (IHS) The International Classification of Headache Disorders, 3rd edition. Cephalalgia. Jan 2018;38(1):1-211. [CrossRef] [Medline]
- Amann J, et al. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak. Nov 30, 2020;20(1):310. [CrossRef] [Medline]
- Jiang LY, Liu XC, Nejatian NP, et al. Health system-scale language models are all-purpose prediction engines. Nature New Biol. Jul 2023;619(7969):357-362. [CrossRef] [Medline]
- Meskó B. The Impact of Multimodal Large Language Models on Health Care’s Future. J Med Internet Res. Nov 2, 2023;25:e52865. [CrossRef] [Medline]
- Dorr DA, Adams L, Embí P. Harnessing the Promise of Artificial Intelligence Responsibly. JAMA. Apr 25, 2023;329(16):1347-1348. [CrossRef] [Medline]
- Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res. Aug 11, 2023;25:e48009. [CrossRef] [Medline]
Abbreviations
LLM: large language model |
Edited by Amaryllis Mavragani; submitted 03.10.23; peer-reviewed by Dillon Chrimes, Peijin Han; final revised version received 02.11.24; accepted 17.11.24; published 24.04.25.
Copyright© Marc Cicero Schubert, Stella Soyka, Wolfgang Wick, Varun Venkataramani. Originally published in JMIR Formative Research (https://formative.jmir.org), 24.4.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.