Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

doi:10.2196/59267

Published on 26.Jun.2024 in Vol 8 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/59267, first published 08.Apr.2024.

Laptop screen showing ChatGPT interface, with a differential diagnosis list for Central diabetes insipidus

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Takanobu Hirosawa¹

; Yukinori Harada¹

; Kazuya Mizuta¹

; Tetsu Sakamoto¹

; Kazuki Tokumasu²

; Taro Shimizu¹

Article Authors Cited by (31) Tweetations Metrics

Journals

Dalal A, Plombon S, Konieczny K, Motta-Calderon D, Malik M, Garber A, Lam A, Piniella N, Leeson M, Garabedian P, Goyal A, Roulier S, Yoon C, Fiskio J, Schnock K, Rozenblum R, Griffin J, Schnipper J, Lipsitz S, Bates D. Adverse diagnostic events in hospitalised patients: a single-centre, retrospective cohort study. BMJ Quality & Safety 2025;34(6):377 View
Barabucci G, Shia V, Chu E, Harack B, Laskowski K, Fu N. Combining Multiple Large Language Models Improves Diagnostic Accuracy. NEJM AI 2024;1(11) View
Sanchez Tena M, Alvarez‐Peregrina C, Martinez‐Perez C. Evaluation of the perception of information from ChatGPT in myopia education: Perspectives of students and professionals. Ophthalmic and Physiological Optics 2025;45(3):883 View
Padovan M, Palla A, Marino R, Porciatti F, Cosci B, Carlucci F, Nerli G, Petillo A, Necciari G, Dell’Amico L, Lucisano V, Scarinci S, Foddis R. ChatGPT-4 vs. Google Bard: Which Chatbot Better Understands the Italian Legislative Framework for Worker Health and Safety?. Applied Sciences 2025;15(3):1508 View
Saglam S, Uludag V, Karaduman Z, Arıcan M, Yücel M, Dalaslan R. Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study. BMC Medical Informatics and Decision Making 2025;25(1) View
Sarı M, Tufenkci P. Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies. European Annals of Dental Sciences 2025;52(1):10 View
Bolgova O, Ganguly P, Mavrych V. Comparative analysis of LLMs performance in medical embryology: A cross‐platform study of ChatGPT, Claude, Gemini, and Copilot. Anatomical Sciences Education 2025;18(7):718 View
Bridges J, Jiang X, Ige M, Toyobo O. Computerized diagnostic decision support systems—Isabel Pro versus ChatGPT-4 part II. JAMIA Open 2025;8(3) View
Zhang A, Chen J. AI-driven network biology identifies SRC as a therapeutic target in metastatic pancreatic adenocarcinoma. Intelligent Oncology 2025;1(3):233 View
Gün M. Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation. BMC Emergency Medicine 2025;25(1) View
Umman V, Tosun B, Uygur A, Emre S. Evaluation of the Effectiveness, Safety, and Patient Satisfaction of Artificial Intelligence-Based Patient Education and Counseling for Both Recipients and Donors in the Preoperative and Postoperative Phases of Organ Transplantation. Transplantation Proceedings 2025;57(9):1832 View
Wu X, Huang Y, He Q. Diagnostic performance of newly developed large language models in critical illness cases: A comparative study. International Journal of Medical Informatics 2025;204:106088 View
Sarvari P, Al-fagih Z. Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method. JMIRx Med 2025;6:e67661 View
Cotfas L, Sandu A, Delcea C, Diaconu P, Frăsineanu C, Stănescu A. From Transformers to ChatGPT: An Analysis of Large Language Models Research. IEEE Access 2025;13:146889 View
Günay Polatkan Ş, Sığırlı D, Durak V, Alak Ç, Kan I. Performance of Generative AI Models on Cardiology Practice in Emergency Service: A Pilot Evaluation of GPT-4.o and Gemini-1.5-Flash. Uludağ Üniversitesi Tıp Fakültesi Dergisi 2025;51(2):239 View
Shimizu T, Hautz W, van Sassen C, Zwaan L. The global progress for improving diagnosis: what we’ve learned, what comes next. Diagnosis 2025;12(4):529 View
Chen G, Lin C, Zhang L, Luo Z, Shin Y, Li X. Virtual case reasoning and AI-assisted diagnostic instruction: an empirical study based on body interact and large language models. BMC Medical Education 2025;25(1) View
Hou C, Zhang H, Zhao P, Lu J, Geng J, Li H, Sun X, He T, Zhang H, Tang Y, Zhang L, Xi Y, Li C, Gao C, Lu X. DeepSeek R1 excels in diagnosing previously misdiagnosed cases. Array 2025;28:100559 View
Sales A, Gizaw C, Beck J, Grauvogel J. Evaluating Large Language Models in Interpreting MRI Reports and Recommending Treatment for Vestibular Schwannoma. Diagnostics 2025;15(22):2841 View
Schroeder A, Tran Z, Sexton K, Salzberg A. Clinician’s Guide to Artificial Intelligence. Medical Clinics of North America 2026;110(2):287 View
Singh P, Shqair L, Naphade O, Sanchez K, Namiri N, Sharma S, Mohamed K, Yu A, Kaur R, Alasadi Y, Hoang T, Walsh A. Accuracy of artificial intelligence in carpal tunnel syndrome management: A comparative analysis of ChatGPT-4o and Gemini 1.5 Pro. Hand Surgery and Rehabilitation 2026;45(1):102560 View
Greengrass C. Transforming clinical reasoning—the role of AI in supporting human cognitive limitations. Frontiers in Digital Health 2026;7 View
Savran A. Performance and reliability of state-of-the-art LLMs in complex hand surgery scenarios: A prospective cross-sectional, double-blinded study. Journal of Orthopaedic Surgery 2026;34(1) View
Hack S, Zalzal H, Attal R, Farzad A, Crew L, Tessler I, Frankel T, Gvili B, Shivatzki S, Wolfovitz A, Rozendorn N. Empowering front-line physicians with AI: Evaluating large language models in everyday ENT care. The American Journal of Emergency Medicine 2026;102:90 View
del Barrio M, Laos K, Lara M, García C, Menendez H. Size doesn’t matter: Assessing the trustworthiness of large language models in medical contexts: A focus on epidural information retrieval. Artificial Intelligence in Medicine 2026;175:103379 View
Duman Şahin Z, Altuntaş V, Yılmaz Muluk S. Use of natural language processing tools in musculoskeletal disability assessment: generating reports and calculating impairment percentages in Turkish health commission settings. Disability and Rehabilitation: Assistive Technology 2026:1 View
Saloojee H, Gramanie M, Mwale R, Bassett B, Madhi S, Kalla I. AI at the bedside: Randomised controlled trial of ChatGPT’s impact on student performance in real-patient clinical exams. Medical Teacher 2026:1 View
Ronel D, Shapiro G, Ben Kiki T, Keren Y. ChatGPT in Orthopedic Trauma: Consistency, Accuracy, and Agreement With Textbook and Expert Opinion. Cureus 2026 View
Sağlam Gürmen E, Yorgancıoğlu M, Oral A. Agreement between ChatGPT and emergency physicians in laceration management: A prospective study. Injury 2026:113478 View

Books/Policy Documents

Hirosawa T. Artificial Intelligence in Medical Diagnostics. View

Conference Proceedings

Mingole B, Majumdar A, Choudhury F, Kraschnewski J, Sundar S, Yadav A. Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models for Health Advice using Crowdsourced Clinical Cases View

Citation

Please cite as:

Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T
Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases
JMIR Form Res 2024;8:e59267
doi: 10.2196/59267 PMID: 38924784 PMCID: 11237772

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Formative Evaluation of Digital Health Interventions (4917) Clinical Informatics (2134) Clinical Information and Decision Making (3530) Decision Support for Health Professionals (2117) Chatbots and Conversational Agents (1133) Artificial Intelligence (4532) mHealth for Diagnosis (147)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn