TY - JOUR AU - Hirosawa, Takanobu AU - Harada, Yukinori AU - Mizuta, Kazuya AU - Sakamoto, Tetsu AU - Tokumasu, Kazuki AU - Shimizu, Taro PY - 2024 DA - 2024/6/26 TI - Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases JO - JMIR Form Res SP - e59267 VL - 8 KW - decision support system KW - diagnostic errors KW - diagnostic excellence KW - diagnosis KW - large language model KW - LLM KW - natural language processing KW - GPT-4 KW - ChatGPT KW - diagnoses KW - physicians KW - artificial intelligence KW - AI KW - chatbots KW - medical diagnosis KW - assessment KW - decision-making support KW - application KW - applications KW - app KW - apps AB - Background: The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists. Objective: This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series. Methods: We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4’s evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician. Results: The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4’s evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians’ evaluations. Conclusions: GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process. SN - 2561-326X UR - https://formative.jmir.org/2024/1/e59267 UR - https://doi.org/10.2196/59267 DO - 10.2196/59267 ID - info:doi/10.2196/59267 ER -