Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports

doi:10.2196/64844

Original Paper

¹Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan

²Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan

³Higashinihonbashinaika Clinic, Tokyo, Japan

⁴Ubie, Inc, Tokyo, Japan

⁵Department of Hospital Medicine, Urasoe General Hospital, Okinawa, Japan

Corresponding Author:

Takanobu Hirosawa, MD, PhD

Department of Diagnostic and Generalist Medicine

Dokkyo Medical University

880 Kitakobayashi, Mibu-cho

Shimotsuga, 321-0293

Japan

Phone: 81 0282861111

Email: hirosawa@dokkyomed.ac.jp

Background: Generative artificial intelligence (AI), particularly in the form of large language models, has rapidly developed. The LLaMA series are popular and recently updated from LLaMA2 to LLaMA3. However, the impacts of the update on diagnostic performance have not been well documented.

Objective: We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports.

Methods: We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding nondiagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2.

Results: In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P<.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63% (247/392) of cases, compared to LLaMA2’s 38% (149/392, P<.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P<.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2.

Conclusions: The results reveal that the LLaMA3 model significantly outperforms LLaMA2 per diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall diagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics.

JMIR Form Res 2024;8:e64844

doi:10.2196/64844

Keywords

artificial intelligence (1744); clinical decision support system (80); generative artificial intelligence (51); large language models (197); natural language processing (762); NLP (203); AI (596); clinical decision making (19); decision support (86); decision making (100); LLM: diagnostic; case report (18); diagnosis (300); generative AI (55); LLaMA

Artificial Intelligence in Medicine

The concept of artificial intelligence (AI) dates back to the 1950s when the potential for machines to mimic human intelligence first began to be explored [Turing AM. Computing machinery and intelligence. Mind. 1950;59:433-460. [CrossRef]1]. Since then, AI technologies, particularly in areas such as neural networks, natural language processing (NLP), and large language models (LLMs), have advanced substantially. These advancements have been driven by significant computational developments and the vast data available in the digital world. Recently, access to these technologies has also become more straightforward, requiring less specific knowledge and fewer resources.

In the realm of AI, neural networks form a foundational concept. These networks mimic the complex interconnections of neurons in the human brain, featuring synapse-like connections that facilitate dynamic learning and adoption. Unlike traditional technologies that rely on static algorithms, neural networks are designed to iteratively adjust the connections between nodes [Abiodun OI, Jantan A, Omolara AE, Dada KV, Mohamed NA, Arshad H. State-of-the-art in artificial neural network applications: a survey. Heliyon. 2018;4(11):e00938. [FREE Full text] [CrossRef] [Medline]2]. NLP enables computers to understand and process human language, facilitating tasks such as text translation, voice command response, and data extraction from complex sources. LLMs, advanced forms of NLP, train on extensive corpora of text to generate coherent and contextually relevant text [Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y. A survey of large language models. arXiv:2303.18223. Preprint published online on March 31, 20233]. These technologies have enabled complex models to achieve improved performance and address challenges that traditional approaches cannot handle, such as analyzing large volumes of data to identify patterns that may not be visible to human analysts.

These advancements are now widespread across various sectors, notably in the medical field. Generative AI systems, such as the GPT series developed by OpenAI, Google’s Gemini, and LLaMA, have demonstrated considerable value in research, education, and potential future clinical applications [Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res. 2023;25:e48568. [FREE Full text] [CrossRef] [Medline]4,Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. 2024;58(11):1276-1285. [CrossRef] [Medline]5]. They have the potential to support medical professionals, patients, and their families, by aiding them in making informed clinical decisions based on comprehensive data analysis.

Generative AI in Medicine

In the medical field, generative AI has been pivotal in advancing diagnostic processes, developing treatment protocols, enabling personalized medicine, and managing patient care [Sai S, Gaur A, Sai R, Chamola V, Guizani M, Rodrigues JJPC. Generative AI for transformative healthcare: a comprehensive study of emerging models, applications, case studies, and limitations. IEEE Access. 2024;12:31078-31106. [CrossRef]6]. By analyzing vast datasets, generative AI uncovers patterns not immediately obvious to medical professionals, providing crucial insights that lead to improved patient outcomes. For example, generative AI systems are instrumental in enhancing clinical decision-making, optimizing clinical workflows, and improving patient outcomes [Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, et al. The role of large language models in transforming emergency medicine: scoping review. JMIR Med Inform. 2024;12:e53787. [FREE Full text] [CrossRef] [Medline]7]. Specifically, in diagnosis, generative AI enhances the medical interview process by visualizing the patient’s perspective [Balas M, Micieli JA. Visual snow syndrome: use of text-to-image artificial intelligence models to improve the patient perspective. Can J Neurol Sci. 2023;50(6):946-947. [CrossRef] [Medline]8], expands the scope of differential diagnosis lists, and supports clinical reasoning [Restrepo D, Rodman A, Abdulnour R. Conversations on reasoning: large language models in diagnosis. J Hosp Med. 2024;19(8):731-735. [CrossRef] [Medline]9,Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, Abdulnour R, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med. 2024;184(5):581-583. [CrossRef] [Medline]10].

From LLaMA2 to LLaMA3

The evolution of generative AI systems has been notably rapid, primarily due to their ability to integrate user feedback and continuously update from expanded datasets. This iterative improvement is evident in the progression from GPT-3 to GPT-4, and more recently to GPT-4o and OpenAI o1 [OpenAI. GPT-4 technical report. OpenAI. 2023. URL: https://cdn.openai.com/papers/gpt-4.pdf [accessed 2024-10-22] 11,Gallifant J, Fiske A, Levites Strekalova YA, Osorio-Valencia JS, Parke R, Mwavu R, et al. Peer review of GPT-4 technical report and systems card. PLOS Digit Health. 2024;3(1):e0000417. [FREE Full text] [CrossRef] [Medline]12]. Similarly, other systems such as Bard have evolved into more advanced versions such as Gemini and Gemini Advanced [Gemini Team Google. Gemini: a family of highly capable multimodal models. arXiv:2312.11805. Preprint published online on December 19, 202313]. In this dynamic landscape, the LLaMA series has also undergone upgrades, moving from LLaMA2 to LLaMA3, enhancing their capabilities [Meta. Build the future of AI with Meta Llama 3 2024. URL: https://llama.meta.com/llama3/ [accessed 2024-10-22] 14].

Generative AI in Diagnostics

In diagnostics, generative AI systems have the potential to enhance diagnostic performance. These systems excel at processing and interpreting complex clinical data from diverse courses such as electronic health records, imaging studies, and genomic data. Notably, the GPT series has demonstrated considerable diagnostic performance in medical benchmarks and complex case analyses [Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78-80. [FREE Full text] [CrossRef] [Medline]15]. While significant strides have been made, studies have indicated that other LLM models, such as LLaMA2, require substantial refinement for optimal application in diagnostics [Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and llama 2 for clinical decision support tasks. Nat Commun. 2024;15(1):2050. [FREE Full text] [CrossRef] [Medline]16,Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA. 2024;331(15):1320-1321. [CrossRef] [Medline]17]. Our own study revealed that the diagnostic performance by LLaMA2 was inferior to those of ChatGPT-4 and Gemini for case-report series [Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Diagnostic performance of generative artificial intelligences for a series of complex case reports. Digit Health. 2024;10:20552076241265215. [FREE Full text] [CrossRef] [Medline]18]. This necessitates ongoing development to improve model accuracy and reliability, ensuring they meet clinical standards and effectively support diagnostic decision-making.

Study Aims

Despite these advancements, the diagnostic capabilities of updated AI models such as LLaMA3 have not been comprehensively explored. There is a particular lack of comparative studies examining the improvements in diagnostic performance from LLaMA2 to LLaMA3. In this context, our study aims to fill this gap by assessing and comparing the diagnostic performance of LLaMA3 to LLaMA2. Specifically, we intend to evaluate their effectiveness in generating differential diagnosis lists for comprehensive case reports. This comparison will explain the evolutionary benefits of the generative AI system upgrade and their practical implications in future diagnostics.

Overview

This was an experimental study using publicly available generative AI systems and published case reports. The entire study was conducted at the Department of Diagnostic and Generalist Medicine (General Internal Medicine), Dokkyo Medical University, Japan. This study consisted of four components, including preparing case reports, generating differentials by AIs, evaluating the differentials, and analysis. The flowchart, including preparing case reports and generating differentials, is shown in Figure 1.

**Figure 1.** The flowchart, including preparing case reports and generating differentials.

Ethical Considerations

We used published case reports; therefore, ethical approval was inapplicable.

Case Reports

We used the dataset from our previous research [Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Diagnostic performance of generative artificial intelligences for a series of complex case reports. Digit Health. 2024;10:20552076241265215. [FREE Full text] [CrossRef] [Medline]18]. Our inclusion criteria included case reports published in the American Journal of Case Reports from January 2022 to March 2023. We excluded nondiagnostic cases and pediatric cases. These exclusion criteria were adopted from a previous study for a clinical decision support system [Graber ML, Mathew A. Performance of a web-based clinical diagnosis support system for internists. J Gen Intern Med. 2008;23 Suppl 1(Suppl 1):37-40. [FREE Full text] [CrossRef] [Medline]19]. For the included case reports, we refined the text data for input. This process involved extracting the clinical narrative from each case report up until the stated final diagnosis. We carefully removed any sections that included clinical assessments or subjective interpretations by the authors to minimize the risk of biasing the AI’s output. This editing was designed to ensure that the input to the AI models was focused on clinical information essential for generating accurate differential diagnoses. The final diagnoses were typically written by the authors. The main investigator, TH, conducted this process, which was validated by another coinvestigator, YH. Details of preparing case reports are shown in

Multimedia Appendix 1

The details of preparing case reports.

DOCX File , 21 KB Multimedia Appendix 1.

Differentials Generated by AIs

We used popular generative AI systems developed by Meta AI, LLaMA3 and LLaMA2, to generate differentials. LLaMA3 offers 8B and 70B versions, while LLaMA2 includes 7B, 13B, and 70B versions. For our study, we used the most capable models, the 70B versions. The main investigator, TH, inputted the same cases into both LLaMA3 and LLaMA2 using the same prompt to generate the top 10 differential diagnosis lists.

Both LLaMA3 and LLaMA2 allowed for several adjustable settings to control the output, including temperature, top-P (nucleus), and max tokens. All parameters were set uniformly for this study. The temperature was set at a low value of 0.01 to prioritize predictability in the model’s output. This setting reduces the randomness and creativity of the responses, favoring deterministic and consistent results ideal for medical diagnostics where accuracy is paramount. The top-P parameter was set at 1, allowing for the broadest selection of words while maintaining focus on relevant content, crucial for generating precise differential diagnoses. Lastly, the max tokens were limited to streamline the output, ensuring that the AI focuses on generating concise, relevant differential-diagnosis lists. Table 1 illustrates the key characteristics of the methods to generate differentials, including adjustable parameters and the prompts. The details of methods to generate differentials, including adjustable parameters and system prompts are shown in

Multimedia Appendix 2

The details of methods to generate differentials, including adjustable parameters and system prompt.

DOCX File , 21 KB Multimedia Appendix 2.

Table 1. The key characteristics of the methods to generate differentials include adjustable parameters and the prompts in this study.

	LLaMA3	LLaMA2
Developer	Meta AI	Meta AI
Version	70B	70B
Release date	April 2024	July 2023
Access date	May 2024	May 2024
Prompt	“Tell me the top 10 suspected illnesses for the following case: (copy and paste the case)”	“Tell me the top 10 suspected illnesses for the following case: (copy and paste the case)”
Temperature	0.01	0.01
Max tokens	500	500
Top-P	1	1

Evaluation

Two expert physicians, T Shiraishi and T Suzuki, independently evaluated the differentials. We adopted a binary approach to evaluate whether the final diagnosis was included in the differential diagnosis lists. When the lists included the final diagnosis, their rankings were also evaluated. To ensure consistency and objectivity in evaluations, any discrepancies between the initial assessments by T Shiraishi and T Suzuki were resolved through a consensus meeting involving a third expert physician, KT. To enhance the reliability of our evaluation process, we considered implementing a κ statistic to quantify interevaluator agreement. All evaluators were blinded to which AI system generated the differentials to prevent bias. The details of evaluation methods are shown in

Multimedia Appendix 3

The details of evaluation methods.

DOCX File , 20 KB Multimedia Appendix 3.

Analysis

In this study, diagnostic performance was defined as the inclusion of the final diagnosis in the differential diagnosis lists.

Outcome

We defined the primary outcome as the ratio of cases where the final diagnosis was included in the top 10 differential diagnosis lists generated by LLaMA2 or LLaMA3. The denominator was the total number of cases. The numerator was the number of cases in which the final diagnosis was included in the lists. The secondary outcomes were defined as the ratios of whether the final diagnosis was included in the top 5 differential diagnosis lists and as the top diagnosis, generated by LLaMA2 or LLaMA3. We defined the primary outcome and the secondary outcomes as overall diagnostic performance. Additionally, interrater reliability between the physicians’ evaluation for the differential diagnosis lists was calculated as the Cohen κ coefficient.

Exploratory Analysis

The dataset for this analysis comprised cases sourced from a broad spectrum of medical specialties. Each case report was tagged with one to six relevant medical specialties, ensuring a comprehensive representation of the diverse areas in medicine. These specialties were included as part of the standardized metadata attached to each case report, facilitating an organized and targeted analysis. In this study, we included only those specialties that were tagged in at least 10 different case reports.

The exploratory analysis involved quantifying the number of cases correctly diagnosed within each specialty and calculating the ratio of cases for each specialty where the final diagnosis was included in the top 10 differential diagnosis lists generated by LLaMA3 or LLaMA2. The denominator was the total number of cases for each specialty. The numerator was the number of cases in which the final diagnosis was included in the lists. Additionally, we calculated 95% CIs for each ratio to assess the precision of our estimates.

Statistical Analysis

Data were analyzed using R software (version 4.2.2; The R Foundation for Statistical Computing). Descriptive statistics for categorical or binary variables were summarized as numbers and percentages. We compared categorical data using the chi-square test to determine statistical significance. All tests were 2-sided, and a P value of <.05 was considered to indicate statistical significance. For assessing interrater reliability, the Cohen κ coefficient was computed using the “irr” package in R. The interpretation of Cohen κ was as follows: a value below 0.4 indicates poor agreement; values between 0.41 and 0.75 suggest fair to good agreement; and values from 0.75 to 1 reflect very good to excellent agreement [Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. New York. John Wiley & Sons; 2003. 20].

Overview Cases

A total of 392 cases were included in this study. LLaMA3 and LLaMA2 generated differentials for all cases. The representative differentials are shown in Table 2. The dataset of cases, differentials by LLaMA3 and LLaMA2, and the final diagnoses are shown in

Multimedia Appendix 4

The dataset of cases, differentials, and the final diagnoses, used in our study.

XLSX File (Microsoft Excel File), 149 KB Multimedia Appendix 4. The cases encompassed a broad range of medical specialties, with the highest number of cases in general internal medicine and surgery, as detailed in .

Table 2. The representative differentials generated by LLaMA3 and LLaMA2.

Final diagnosis	LLaMA3	LLaMA2
Dementia with Lewy bodies	Lewy body dementia Parkinson disease Dementia with Lewy bodies Alzheimer disease Rapid eye movement sleep behavior disorder Neuroleptic malignant syndrome Serotonin syndrome Delirium Frontotemporal dementia Vascular dementia	Parkinson disease Dementia with Lewy bodies Alzheimer disease Vascular dementia Frontotemporal dementia Huntington disease Creutzfeldt-Jakob disease Wernicke-Korsakoff syndrome Mild cognitive impairment Schizophrenia
Infectious myositis	Necrotizing fasciitis Sepsis Cellulitis Myositis Pyomyositis Abscess Osteomyelitis Endocarditis Bacteremia Gas gangrene	Infectious shock Sepsis Bacterial arthritis Muscle abscess Necrotizing fasciitis Cellulitis Pyomyositis Osteomyelitis Septicemia Multiorgan dysfunction syndrome
Wandering spleen	Wandering spleen Splenoptosis Intestinal malrotation Pancreatic heterotopy Gastric volvulus Midgut volvulus Internal hernia Ovarian torsion Uterine fibroid Pelvic inflammatory disease	Chronic pancreatitis Pancreatic neuroendocrine tumor Splenic tumor Intestinal obstruction Chronic appendicitis Pelvic inflammatory disease Endometriosis Ovarian cyst Gastrointestinal stromal tumor Mesenteric ischemia

Table 3. Medical specialties in all cases and those correctly identified solely by LLaMA3 and LLaMA2.

Medical specialty^a	All cases (N=392), n	Cases correctly identified solely by LLaMA3 (N=312), n (%, 95% CIs)	Cases correctly identified by LLaMA2 (N=195), n (%, 95% CIs); improvement ratios
Surgery	67	50 (74.6, 72.5-77)	34 (50.7, 49-52.5)
General internal medicine	64	55 (85.9, 83.7-88.2)	35 (54.7, 52.9-56.5)
Infectious diseases	55	48 (87.3, 84.8-89.7)	39 (70.9, 68.7-73.1)
Cardiology	49	38 (77.6, 75.1-80)	20 (40.8, 39.1-42.6)
Neurology	42	37 (88.1, 85.3-90.9)	23 (54.7, 52.5-57)
Urology	40	34 (85, 82.1-87.9)	23 (57.5, 55.2-59.8)
Oncology	32	26 (81.3, 78.1-84.4)	19 (59.4, 56.7-62)
Metabolic diseases	32	26 (81.3, 78.1-84.4)	19 (59.4, 56.7-62)
Radiology	29	20 (69, 65.9-72)	16 (55.2, 52.5-57.9)
Critical care medicine	27	22 (81.5, 78.1-84.9)	9 (33.3, 31.2-35.5)
Gastrointestinal	27	21 (77.8, 74.5-81.1)	10 (37, 34.7-39.3)
Hematology	22	17 (77.3, 73.6-80.9)	10 (45.5, 42.6-48.3)
Rheumatology	19	14 (73.7, 69.8-77.5)	12 (63.2, 59.6-66.7)
Nephrology	18	14 (77.8, 73.7-81.9)	8 (44.4, 41.4-47.5)
Respiratory	18	15 (83.3, 79.1-87.6)	11 (61.1, 57.5-64.7)
Obstetrics and gynecology	17	11 (64.7, 60.9-68.5)	8 (47.1, 43.8-50.3)
Endocrinology	16	14 (87.5, 82.9-92.1)	7 (43.8, 40.5-47)
Otolaryngology	13	10 (76.9, 72.2-81.7)	4 (30.8, 27.8-33.8)
Orthopedics	10	6 (60, 55.2-64.8)	4 (40, 36.1-43.9)

^aEach case report was tagged with one to six relevant medical specialties.

Overall Diagnostic Performance

The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P<.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63% (247/392) of cases, compared to LLaMA2’s 38% (149/392, P<.001). Moreover, the final diagnosis was accurately identified as the top diagnosis by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P<.001). The overall diagnostic performance of LLaMA3 and LLaMA2 is shown in Table 4. We observed fair to good agreement between physicians’ evaluations for the differential diagnosis lists, with a κ coefficient of 0.69, indicating concordance in 84.2% (660/784) of cases.

Table 4. Overall diagnostic performance of LLaMA3 and LLaMA2.

Diagnostic performance	LLaMA3, n/N (%)	LLaMA2, n/N (%)	P value^a
The ratio of whether the final diagnosis was included in the top 10 differential diagnosis lists	312/392 (79.6)	195/392 (49.7)	<.001
The ratio of whether the final diagnosis was included in the top 5 differential diagnosis lists	247/392 (63)	149/392 (38)	<.001
The ratio of whether the final diagnosis was included as the top diagnosis	133/392 (33.9)	89/392 (22.7)	<.001

^aP value from chi-squared test.

Exploratory Analysis by Medical Specialty

The exploratory analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2 in almost all fields. All specialties showed improvements of more than 10% from LLaMA2 to LLaMA3, with nonoverlapping 95% CIs, indicating statistically significant enhancements. Specifically, critical care medicine, gastrointestinal, endocrinology, and otolaryngology exhibited remarkable improvements of more than 40% from LLaMA2. Conversely, infectious diseases, radiology, and obstetrics and gynecology showed the least improvements, with about a 10% increase from LLaMA2 to LLaMA3. Other specialties exhibited moderate improvements with 20%-30%. Ophthalmology demonstrated the highest accuracy with 71.4% (5/7) of cases correctly identified, followed by otolaryngology at 61.5% (8/31). Lower accuracy was observed in specialties such as rehabilitation medicine at 11.1% (1/9) and rheumatology at 15.8% (3/19). Other specialties such as general internal medicine and surgery showed moderate performance with accuracies of 34.4% (22/64) and 28.4% (19/67), respectively. Table 3 details the breakdown of medical specialties, showing the total number of cases and those correctly identified by LLaMA3 and LLaMA2 in all cases and those correctly identified solely by LLaMA3. Figure 2 presents a radar chart illustrating the ratio of cases for each specialty where the final diagnosis was included in the top 10 differential diagnosis lists generated by both LLaMA3 or LLaMA2.

**Figure 2.** Radar chart illustrating the improvement ratios for the inclusion of the final diagnosis within the top 10 differential diagnosis lists generated by LLaMA2 and LLaMA3, across various medical specialties. Each axis on the radar chart represents a specific medical specialty. The numerical values adjacent to each specialty name reflect the total number of cases analyzed within that specialty, providing context for the observed performance metrics.

Principal Results

This study demonstrated that the LLaMA3 model significantly outperforms LLaMA2 in overall diagnostic performance, showing almost 1.5-fold improvement. Specifically, the inclusion rate of the final diagnosis in the top 10 differentials rose from 50% to 80%. This substantial enhancement reflects marked advancements within the LLaMA series over a relatively short period.

These enhancements likely come from the implementation of more advanced algorithms and more robust training datasets, highlighting the rapid evolution of generative AI capabilities in medical diagnostics. The significantly higher inclusion rates of the final diagnosis in the top 10, top 5 differentials, and the top diagnosis by LLaMA3 indicate that its model has been finely tuned for greater precision in analyzing complex medical cases. This tuning suggests that LLaMA3 is more adept at incorporating clinical nuances and recognizing a diverse range of symptoms, which is critical for generating accurate differential diagnoses in real-world clinical settings.

Model Bias and Generalizability

While this study leverages data from a single journal, it is crucial to consider how this might limit the generalizability of the findings. The cases predominantly represent complex or rare medical scenarios, which might not fully represent routine clinical situations found across diverse health care systems [Riley DS, Barber MS, Kienle GS, Aronson JK, von Schoen-Angerer T, Tugwell P, et al. CARE guidelines for case reports: explanation and elaboration document. J Clin Epidemiol. 2017;89:218-235. [CrossRef] [Medline]21]. This focus could skew the AI’s performance, suggesting that while LLaMA3 shows promise, its effectiveness in general practice remains to be validated in a more varied clinical context.

Practical Implementation and Challenges

Integrating LLaMA3 into clinical practice presents several challenges that require careful consideration. The foremost is regulatory approval, as generative AI, including the LLaMA series, has not yet been approved for direct clinical applications such as AI-enhanced diagnostics. Regulatory hurdles can significantly delay or impede the practical application of innovative technologies. Furthermore, clinician trust in AI decision-making is vital and requires the AI to be not only effective but also transparent in how decisions are derived. Clinicians must be able to comprehend how decisions are derived to confidently integrate AI recommendations into their workflow.

The computational demands of running sophisticated models such as LLaMA3 also pose a significant challenge. High-performance computing resources, such as Graphics Processing Units or cloud-based solutions, are essential to operate these advanced AI systems effectively, which could limit their deployment in resource-constrained settings.

Future Research and Development

To facilitate the effective integration of AI such as LLaMA3 into health care workflows, ongoing training with real-world data and continuous feedback from clinical use are indispensable. This iterative process will help ensure that the AI remains accurate and adapts to evolving medical standards. Exploring multimodal AI that incorporates text and image data from electronic health records could enhance diagnostic accuracy. Future studies should focus on integrating these systems with routine health care workflows to assess their practical utility and acceptance among health care providers. Additionally, addressing potential biases in AI decision-making and ensuring adherence to ethical health care standards are crucial for gaining acceptance and trust in clinical environments.

Results From Exploratory Analysis

The exploratory analysis across different medical specialties provided a view of LLaMA3’s performance, which varied across fields. For instance, specialties, including critical care medicine, showed exceptionally high improvements in diagnostic accuracy with LLaMA3. This finding highlights its effectiveness in processing complex clinical courses.

However, the analysis also uncovered areas with modest improvements. For instance, radiology showed small improvements, with about a 10% increase from LLaMA2. This result suggests a need for multimodal AI that can process image data in addition to text data [Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28(9):1773-1784. [CrossRef] [Medline]22]. Multimodal AI enables the simultaneous processing and understanding of multiple forms, including text and image data, which is particularly pertinent for enhancing diagnostic accuracy in radiology.

The variability in these improvements highlights the importance of targeted algorithmic training tailored to the specific demands of each medical specialty. Specialized training datasets that encompass the wide range of scenarios encountered in particular fields could be crucial in enhancing the generative AI’s learning curve and improving its utility in clinical practice. The performance of LLaMA3 varies across medical specialties, with notably high improvement ratios in ophthalmology and otolaryngology, likely due to the distinct and well-defined symptoms associated with conditions treated within these fields. Conversely, specialties such as rehabilitation medicine and rheumatology showed lower improvement ratios, attributed to the complexity of the clinical course and immune responses, posing challenges for the current model’s diagnostic algorithms. A significant factor contributing to the variation in performance is the relatively small number of cases available for some specialties.

Strengths

A major strength of this study is the controlled comparison of diagnostic performances using identical cases and standardized parameters, providing a clear assessment of improvements from LLaMA2 to LLaMA3. Additionally, the longitudinal assessment of the LLaMA series offers valuable insights into the developmental course of AI models in medical diagnostics. This is particularly notable when contrasted with findings from other AI systems where no improvement was noted over time [Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal changes in diagnostic accuracy of a differential diagnosis list developed by an AI-based symptom checker: retrospective observational study. JMIR Form Res. 2024;8:e53985. [FREE Full text] [CrossRef] [Medline]23].

Limitations