Abstract
This study evaluates the dynamic clinical reasoning of 4 leading large language models in complex nephrology cases, demonstrating that while Gemini 2.5 Pro achieved the highest reasoning scores and computational efficiency, all tested models excelled at static data synthesis but shared vulnerabilities in formulating nuanced differential diagnoses and in prospective clinical planning.
JMIR Form Res 2026;10:e89726doi:10.2196/89726
Keywords
Introduction
While large language models (LLMs) are increasingly being applied in medicine, evaluating their performance relies on static knowledge tests, such as medical licensing examinations, which fail to capture the dynamic, iterative reasoning of real clinical practice [,]. Recent benchmark studies have begun to address this gap; for instance, the MedR-Bench [] framework evaluates medical LLMs across 3 clinical stages using an automated, artificial intelligence (AI)–driven “reasoning evaluator” to score free-text reasoning. However, while automated metrics provide scalability, they cannot fully replace rigorous human verification in medical contexts.
The aim of this study is to perform a clinical evaluation of 4 leading LLMs (GPT-o3 [OpenAI], Gemini 2.5 Pro [Google], DeepSeek-R1 [Hangzhou DeepSeek Artificial Intelligence], and Llama 4 Maverick [Meta]) in nephrology [], a specialty renowned for its complex, multisystemic pathologies and diagnostic challenges, using a multiagent architecture for temporal workflows []. Rather than using clinical reasoning with broad automated metrics, we systematically deconstructed the reasoning process into 9 distinct, scorable cognitive steps mapped to real-world workflows.
Methods
Overview
A detailed description of the methods is provided in . Briefly, 4 nephrologists used the Delphi method to select 10 cases that met the inclusion criteria from over 100 case reports []. As permission could not be obtained for 1 case, 9 cases were included in the final analysis.
We developed a clinical reasoning application using Dify [], a no-code AI development platform that enables the creation of AI agents leveraging LLMs via application programming interface (API) integration. We evaluated 4 LLMs: DeepSeek-R1 (DeepSeek-R1-0528, released May 28, 2025), Gemini 2.5 Pro (preview 03‐25, released May 25, 2025), GPT-o3 (released April 16, 2025), and Llama 4 Maverick (meta-llama/Llama-4-Maveric-17B-128E-Instruct-FP8, released July 20, 2025). All were accessed via the together.ai [] API. All 4 LLMs used the same sequential 3-agent architecture designed to mirror the temporal progression of clinical practice. To systematically evaluate model performance, we deconstructed clinical reasoning into 9 cognitive steps ().
The evaluation of the LLM outputs was conducted on July 20, 2025. The primary outcome was the reasoning quality score, measured on a 3-point scale (0=incorrect; 1=reasonable but suboptimal; 2=correct). The 4 nephrologists independently and blindly scored the randomized, deidentified outputs. Results were aggregated by blinded researcher. Group differences were assessed using the Kruskal-Wallis test, followed by pairwise comparisons with Holm correction for multiple testing. A 2-sided P value <.05 was considered statistically significant. Interrater reliability was assessed using the intraclass correlation coefficient (ICC[2,k]). Full prompts and system details are available in .
| Agent | Clinical stage | Input data | Nine cognitive steps and evaluated specific reasoning tasks (user prompts) | Expected output format and constraints (assistant prompts) |
| Agent 1 | Initial clinical assessment | Patient’s chief concern and brief initial clinical findings |
| Include age, sex, and time course; present a prioritized differential list with rationale; list purposeful questions or examinations to rule in or out; develop a cost-effective and logical testing plan |
| Agent 2 | Diagnostic refinement | Newly obtained diagnostic test results |
| Explain medical significance beyond mere abnormal values; appropriately revise the prioritization of differentials; develop a treatment plan considering both evidence-based medicine and patient factors |
| Agent 3 | Therapeutic evaluation | Information on treatments administered and subsequent clinical outcomes |
| List specific measures for evaluating efficacy and side effects; explain the thought process for rapidly reassessing causes and flexibly revising the plan |
Ethical Considerations
Under the Ethical Guidelines for Medical and Biological Research Involving Human Subjects in Japan, this study was exempt from institutional review board review and informed consent requirements, as it exclusively involved the secondary analysis of fully anonymized, publicly available case reports without accessing personal health information.
Results
Overall performance differed significantly among the 4 LLMs (). Gemini 2.5 Pro achieved the highest average score (mean 7.57, SD 0.61), followed by GPT-o3 (mean 7.39, SD 0.81), DeepSeek-R1 (mean 7.13, SD 1.03), and Llama 4 Maverick (mean 6.23, SD 0.93). Across all models (), Q2 and Q7 were the most challenging tasks, with the lowest mean scores (mean 6.56, SD 1.13 and mean 6.58, SD 0.97, respectively), while the models performed best on Q1 and Q6 (mean 7.50, SD 0.77, and mean 7.50, SD 0.85, respectively). shows a heatmap of average scores by model for each clinical reasoning question. Gemini 2.5 Pro demonstrated superior or competitive performance, particularly on complex tasks such as Q6 (mean 7.89, SD 0.85), Q5 (mean 8.00, SD 0.87), and Q9 (mean 8.00, SD 0.88). The ICC(2,k) was 0.36 (95% CI 0.24‐0.46). Gemini 2.5 Pro was efficient (mean response time: 124.7, SD 16.0 sec), whereas DeepSeek-R1 incurred the highest computational cost, with the longest response time (mean 249.7, SD 68.6 sec) and highest token use (mean 16,105.1, SD 4508.5) (Figure S1 in ). In contrast, Llama 4 Maverick had the shortest response time and the lowest token use.

Discussion
We demonstrate that while LLMs effectively processed clinical data, overall performance varied significantly among models. Gemini 2.5 Pro achieved the highest overall reasoning quality score while maintaining computational efficiency. Our step-by-step evaluation revealed a consistent pattern across all models: while they excelled at information synthesis and test interpretation (questions 1 and 6), they shared specific vulnerabilities in higher-order cognitive tasks, particularly in formulating nuanced differential diagnoses (question 2) and planning optimal interventions (question 7).
Recent benchmarks, such as AMIE, MAI-DxO, and MedR-Bench [-,], have advanced AI evaluation from static examinations toward dynamic clinical workflows. However, these frameworks often assess reasoning using broad, automated metrics. By deconstructing clinical reasoning into a stage-gated, multiagent workflow evaluated by domain experts, our study pinpoints exact cognitive bottlenecks. Our results indicate that LLMs struggle heavily with the higher-order, divergent skills required for prospective planning []. For example, when a nephrotic syndrome patient developed sudden-onset nocardiosis, the models failed to adequately pivot their proposed treatment plans. This exposes a critical gap between static knowledge retrieval and the fluid, iterative reality of complex multisystemic specialties like nephrology.
Furthermore, our findings challenge the assumption that bigger models are always better. We revealed a nuanced relationship between reasoning quality and computational efficiency. For real-world clinical deployment, where rapid bedside decision-making and sustainable API costs are paramount, identifying models that balance high reasoning quality with low computational overhead is an encouraging and practical metric.
A primary limitation is the small sample size of 9 nephrology cases evaluated across 9 specific questions, which restricts the generalizability of our findings across diverse, real-world clinical scenarios. However, the cases were rigorously curated via Delphi consensus from 104 candidates, prioritizing diagnostic complexity over breadth. By deconstructing these cases to generate 324 expert-scored data points per model, this granular approach might provide information power to detect statistically significant performance differences. Additionally, our reliance on expert-scored evaluation showed limited interrater agreement, and the study currently lacks extensive statistical validation. Our findings should be interpreted as an early, exploratory assessment of AI clinical reasoning rather than a definitively generalizable conclusion.
Ultimately, future model development in the medical field must move beyond static examinations to prioritize dynamic adaptability and prospective clinical planning. By emphasizing targeted cognitive evaluations and computational efficiency over sheer model size, the medical community can ensure the safe and sustainable integration of AI into real-world bedside practice.
Acknowledgments
The authors used Gemini (3.1 Pro) and ChatGPT (GPT-5) to refine the English composition. All generated content was critically reviewed by the authors. The experimental evaluation of the four models was conducted independently of these writing-assistant tools.
Funding
This research was partially funded by the Advanced Medical Personnel Training Program (principal investigator TN) and was supported by the Ministry of Education, Culture, Sports, Science, and Technology of Japan.
Data Availability
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
Authors' Contributions
YY conceptualized the study and developed the methodology. HN, SK, TK, and YN curated the data. HK and MO conducted the formal analysis. AH, MN, HM, TN, SM, IM, YI, HO, YS, and NK supervised the study. TN acquired funding. YY drafted the manuscript, and all authors reviewed, edited, and approved the final version.
Conflicts of Interest
None declared.
Multimedia Appendix 1
Delphi case selection process, multiagent system architecture, prompt design, and comprehensive model responses across 9 structured clinical cases.
DOCX File, 2308 KBReferences
- McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. Nature New Biol. Jun 2025;642(8067):451-457. [CrossRef] [Medline]
- Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence. Nature New Biol. Jun 2025;642(8067):442-450. [CrossRef] [Medline]
- Qiu P, Wu C, Liu S, et al. Quantifying the reasoning abilities of LLMs on clinical cases. Nat Commun. Nov 6, 2025;16(1):9799. [CrossRef] [Medline]
- Boyle SM, Martindale J, Parsons AS, et al. Development and validation of a formative assessment tool for nephrology fellows’ clinical reasoning. Clin J Am Soc Nephrol. Jan 1, 2024;19(1):26-34. [CrossRef] [Medline]
- Berger A, Khanna S, Berghaus D, Sifa R. Reasoning LLMs in the medical domain: a literature survey. arXiv. Preprint posted online on Aug 26, 2025. [CrossRef]
- Humphrey-Murto S, Varpio L, Gonsalves C, Wood TJ. Using consensus group methods such as Delphi and nominal group in medical education research. Med Teach. Jan 2017;39(1):14-19. [CrossRef] [Medline]
- Dify. URL: https://dify.ai/jp [Accessed 2025-12-09]
- together.ai. 2025. URL: https://www.together.ai [Accessed 2025-12-09]
- Nori H, Daswani M, Kelly C, et al. Sequential diagnosis with language models. arXiv. Preprint posted online on Jun 27, 2025. [CrossRef]
- Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. Sep 2024;30(9):2613-2622. [CrossRef] [Medline]
Abbreviations
| AI: artificial intelligence |
| API: application programming interface |
| LLM: large language model |
Edited by Ivan Steenstra; submitted 17.Dec.2025; peer-reviewed by Hasan Kurban, Jiajia Huang; final revised version received 15.Apr.2026; accepted 23.Apr.2026; published 03.Jun.2026.
Copyright© Yuichiro Yano, Hiroaki Kakizaki, Hajime Nagasu, Seiji Kishi, Takeo Koshida, Yoshihito Nihei, Akira Hirano, Masaomi Nangaku, Hirotake Mori, Toshio Naito, Mizuki Ohashi, Shoichi Maruyama, Isao Matsui, Yoshitaka Isaka, Hirokazu Okada, Yusuke Suzuki, Naoki Kashihara. Originally published in JMIR Formative Research (https://formative.jmir.org), 3.Jun.2026.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

