Search Articles

View query in Help articles search

Search Results (1 to 10 of 2379 Results)

Download search results: CSV END BibTex RIS


Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers

For example, Katz et al [28] reported that GPT-4’s accuracy rates ranged from 17.42% (n=21) to 74.7% (n=90) across various medical disciplines. In contrast, our study found GPT-4’s accuracy rates ranged more narrowly, from 53.5% (n=704) to 60.35% (n=1076). This discrepancy could be partially attributed to the differing medical disciplines emphasized in each study, as well as variations in question structure.

Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E Abdulnour

J Med Internet Res 2025;27:e64452