Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

doi:10.2196/48023

Published on 13.Oct.2023 in Vol 7 (2023)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/48023, first published 09.Apr.2023.

Robotic hand and human hand interact with AI brain graphic on a futuristic interface.

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study

Yasutaka Yanagita¹

; Daiki Yokokawa¹

; Shun Uchida¹

; Junsuke Tawara²

; Masatomi Ikusaka¹

Article Authors Cited by (70) Tweetations (11) Metrics

Journals

Kim J, Kim S, Choi J, Lee Y. Reliability of ChatGPT for performing triage task in the emergency department using the Korean Triage and Acuity Scale. DIGITAL HEALTH 2024;10 View
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interactive Journal of Medical Research 2024;13:e54704 View
Chau R, Thu K, Yu O, Lo E, Hsung R, Lam W. Response to Generative AI in Dental Licensing Examinations: Comment. International Dental Journal 2024;74(4):897 View
Yokokawa D, Yanagita Y, Li Y, Yamashita S, Shikino K, Noda K, Tsukamoto T, Uehara T, Ikusaka M. For any disease a human can imagine, ChatGPT can generate a fake report. Diagnosis 2024;11(3):329 View
Pinto V, de Azevedo M, Wroclawski M, Gentile G, Jesus V, de Bessa Junior J, Nahas W, Sacomani C, Sandhu J, Gomes C. Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence. Neurourology and Urodynamics 2024;43(4):935 View
Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Medical Education 2024;10:e54393 View
Noda M, Ueno T, Koshu R, Takaso Y, Shimada M, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Medical Education 2024;10:e57054 View
Sato H, Ogasawara K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. Journal of Educational Evaluation for Health Professions 2024;21:4 View
Kawahara T, Sumi Y. GPT-4/4V's performance on the Japanese National Medical Licensing Examination. Medical Teacher 2025;47(3):450 View
Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, Lin A, Zhang J, Luo P. The Evaluation of Generative AI Should Include Repetition to Assess Stability. JMIR mHealth and uHealth 2024;12:e57978 View
Jedrzejczak W, Skarzynski P, Raj-Koziak D, Sanfins M, Hatzopoulos S, Kochanek K. ChatGPT for Tinnitus Information and Support: Response Accuracy and Retest after Three and Six Months. Brain Sciences 2024;14(5):465 View
Bharatha A, Ojeh N, Fazle Rabbi A, Campbell M, Krishnamurthy K, Layne-Yarde R, Kumar A, Springer D, Connell K, Majumder M. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy. Advances in Medical Education and Practice 2024;Volume 15:393 View
Hirano Y, Hanaoka S, Nakao T, Miki S, Kikuchi T, Nakamura Y, Nomura Y, Yoshikawa T, Abe O. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Japanese Journal of Radiology 2024;42(8):918 View
Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT’s ability to generate illness scripts: an evaluative study. BMC Medical Education 2024;24(1) View
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. Journal of Medical Internet Research 2024;26:e60807 View
Rossettini G, Rodeghiero L, Corradi F, Cook C, Pillastrini P, Turolla A, Castellini G, Chiappinotto S, Gianola S, Palese A. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Medical Education 2024;24(1) View
Hsieh C, Hsieh H, Lin H. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon 2024;10(14):e34851 View
Ishida K, Hanada E. Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review. Cureus 2024 View
Ishida K, Arisaka N, Fujii K. Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination. Journal of Medical Systems 2024;48(1) View
Jin H, Lee H, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Medical Education 2024;24(1) View
Sallam M, Al-Salahat K, Eid H, Egger J, Puladi B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. Advances in Medical Education and Practice 2024;Volume 15:857 View
Fujimoto M, Kuroda H, Katayama T, Yamaguchi A, Katagiri N, Kagawa K, Tsukimoto S, Nakano A, Imaizumi U, Sato-Boku A, Kishimoto N, Itamiya T, Kido K, Sanuki T. Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam. Cureus 2024 View
Yanagita Y, Yokokawa D, Uchida S, Li Y, Uehara T, Ikusaka M. Can AI-Generated Clinical Vignettes in Japanese Be Used Medically and Linguistically?. Journal of General Internal Medicine 2024;39(16):3282 View
Ramgopal S, Varma S, Gorski J, Kester K, Shieh A, Suresh S. Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank. Pediatric Emergency Care 2024;40(12):871 View
Song E, Lee S. Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions. International Journal of Dental Hygiene 2025;23(2):267 View
Liu M, Okuhara T, Chang X, Okada H, Kiuchi T, Khlaif Z. Performance of ChatGPT in medical licensing examinations in countries worldwide: A systematic review and meta-analysis protocol. PLOS ONE 2024;19(10):e0312771 View
Aster A, Laupichler M, Rockwell-Kollmann T, Masala G, Bala E, Raupach T. ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review. Medical Science Educator 2024;35(1):555 View
Chen C, Li X, Luo H. Evaluation of accuracies of large language models in answering clinical questions related to Mediterranean diet on cardiodiabesity. Interdisciplinary Nursing Research 2024;3(3):157 View
Sugawara Y, Hirakawa Y, Nangaku M. Telemedicine in nephrology: future perspective and solutions. Clinical Kidney Journal 2024;17(Supplement_2):ii1 View
Ho C, Tian T, Ayers A, Aaron R, Phillips V, Wolf R, Mathioudakis N, Dai T, Klonoff D. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Medical Informatics and Decision Making 2024;24(1) View
Miyazaki Y, Hata M, Omori H, Hirashima A, Nakagawa Y, Eto M, Takahashi S, Ikeda M. Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions. JMIR Medical Education 2024;10:e63129 View
Morishita M, Fukuda H, Yamaguchi S, Muraoka K, Nakamura T, Hayashi M, Yoshioka I, Ono K, Awano S. An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination. The Saudi Dental Journal 2024;36(12):1577 View
Shen Y, Xu Y, Ma J, Rui W, Zhao C, Heacock L, Huang C. Multi-modal large language models in radiology: principles, applications, and potential. Abdominal Radiology 2024;50(6):2745 View
Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthopaedics & Traumatology: Surgery & Research 2025;111(8):104080 View
Kaewboonlert N, Poontananggul J, Pongsuwan N, Bhakdisongkhram G. Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study. JMIR Medical Education 2025;11:e58898 View
Jin H, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study. JMIR Medical Education 2024;10:e57451 View
Patel S, Patel R. Embracing Large Language Models for Adult Life Support Learning. Cureus 2024 View
Maraqa N, Samargandi R, Poichotte A, Berhouet J, Benhenneda R. Comparaison des performances des internes français de chirurgie orthopédique et de l’intelligence artificielle ChatGPT-4/4o aux examens du diplôme d’études spécialisées de chirurgie orthopédique et traumatologique. Revue de Chirurgie Orthopédique et Traumatologique 2025 View
Fukushima T, Manabe M, Yada S, Wakamiya S, Yoshida A, Urakawa Y, Maeda A, Kan S, Takahashi M, Aramaki E. Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset. JMIR Medical Informatics 2025;13:e65047 View
Ramgopal S, Suresh S. Reply to Kleebayoon and Wiwanitkit. Pediatric Emergency Care 2025;41(4):e28 View
Wang X, Ye H, Zhang S, Yang M, Wang X. Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases. Journal of Medical Systems 2025;49(1) View
Yagahara A, Uesugi M, Yokoi H. Exploration of the optimal deep learning model for english-Japanese machine translation of medical device adverse event terminology. BMC Medical Informatics and Decision Making 2025;25(1) View
Yaman Kula A, Durmaz Çeli̇k N, Özben S, Yatmazoğlu Çeti̇n M, Köseoğlu M. Artificial intelligence versus neurologists: A comparative study on multiple sclerosis expertise. Clinical Neurology and Neurosurgery 2025;250:108785 View
Mane P. Accuracy and creativity analysis of ChatGPT in quantitative aptitude. The International Journal of Information and Learning Technology 2025;42(2):224 View
Rider N, Li Y, Chin A, DiGiacomo D, Dutmer C, Farmer J, Roberts K, Savova G, Ong M. Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders. Journal of Allergy and Clinical Immunology 2025;156(1):81 View
Siddiqui Z, Azeez M, Sohail S, Ahmed J, Madsen D. A preliminary exploration of ChatGPT’s potential in medical reasoning and patient care. Critical Public Health 2025;35(1) View
Pinto V, de Bessa J, Gomes C. Response to Letter to the Editor: Conformity of ChatGPT Recommendations With the AUA/SUFU Guideline on Postprostatectomy Urinary Incontinence. Neurourology and Urodynamics 2025;44(5):1221 View
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. Journal of Medical Internet Research 2025;27:e64486 View
Wu Y, Wu Y, Chang Y, Yu C, Wu C, Sung W, Atoum I. Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams. PLOS One 2025;20(6):e0324841 View
Yodrabum N, Chaisrisawadisuk S, Apichonbancha S, Khaogate K, Noraset T, Sakkitjarung C, Moore M. Artificial Intelligence and Human Expertise in Cleft Lip and Palate Care: A Comparative Study of Accuracy, Readability, and Treatment Quality. Journal of Craniofacial Surgery 2026;37(3/4):425 View
Fukataki Y, Hayashi W, Nishimoto N, Ito Y, Kuo P. Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT’s accuracy and reproducibility. PLOS Digital Health 2025;4(6):e0000695 View
Liang Z, Kuang Y, Liang X, Liang G, Li Z. A Comparative Study of the Accuracy and Readability of Responses from Four Generative AI Models to COVID-19-Related Questions. COVID 2025;5(7):99 View
Sato H, Ogasawara K, Sakurai H. Performance Evaluation of 18 Generative AI Models (ChatGPT, Gemini, Claude, and Perplexity) in 2024 Japanese Pharmacist Licensing Examination: Comparative Study. JMIR Medical Education 2025;11:e76925 View
Jaleel A, Aziz U, Farid G, Zahid Bashir M, Mirza T, Khizar Abbas S, Aslam S, Sikander R. Evaluating the Potential and Accuracy of ChatGPT-3.5 and 4.0 in Medical Licensing and In-Training Examinations: Systematic Review and Meta-Analysis. JMIR Medical Education 2025;11:e68070 View
Ignjatović A, Anđelković Apostolović M, Stevanović L, Radovanović P, Topalović M, Filipović T, Otašević S. ChatGPT’s progress over time: A longitudinal enhancing biostatistical problem-solving in medical education. Health Informatics Journal 2025;31(3) View
Kasagga A, Sapkota A, Changaramkumarath G, Abucha J, Wollel M, Somannagari N, Husami M, Hailu K, Kasagga E. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression. Cureus 2025 View
Yanagita Y, Yokokawa D, Ihara S, Yoshida R, Okano Y, Uehara T. Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study. JMIR Formative Research 2025;9:e80752 View
Sezen A, Şahin Özdemir M, Özdemir Y. Comparative Evaluation of ChatGPT and Gemini in Answering Questions on Vaccines and Immunization. Genel Tıp Dergisi 2025;35(5):1011 View
Hidalgo Guevara J, Hidalgo Guevara A. Effect of Generative Artificial Intelligence Use on Diagnostic Learning in Medical Students: A Quasi-Experimental Study. Salud, Ciencia y Tecnología 2025;5:1564 View
Saita K, Mine Y, Amano S. What the performance of multimodal LLMs on a national licensing exam teaches us about occupational therapy education. BMC Medical Education 2026;26(1) View
Miyamura M, Fujiki G, Kanzaki Y, Tsuda K, Asano H, Morita H, Hoshiga M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education 2026;5(1):9 View
Foster A, Price N, Brown V, Reed S. Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis. Annals of Pharmacy Education, Safety, and Public Health Advocacy 2022;2(1):176 View
Hang Y, Wu J, Bai L, Wu M, Yu J, Li L, Piao X. Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study. BMC Medical Informatics and Decision Making 2026;26(1) View
Erdem O, Canbak T, Acar A, Ceylan E, Çakıt H, Başak F. Guideline-based, but not error-free: Multilingual risks in AI-powered patient counseling on gallstones. International Journal of Medical Informatics 2026;212:106341 View
Yacobson E, Schleifer Y, Bar-Dov Z, Rap S, Blonder R, Alexandron G. Benchmarking AI on Standard Chemistry Exams: LLMs Still Underperform Compared to High School Students. Journal of Science Education and Technology 2026 View
Tiller N, Marcon A, Zenone M, Kidd K, Jeukendrup A, Master Z, Caulfield T. Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit. BMJ Open 2026;16(4):e112695 View
Ozturk C, Keskin T, Baskurt F. Assessing ChatGPT’s responses to office ergonomics and spine health questions. International Journal of Occupational Safety and Ergonomics 2026:1 View
Beşparmak A, Beşparmak T. Performance of artificial intelligence chatbots in providing feeding management and oral health guidance for children with cleft lip and palate: ChatGPT 5.2 vs Gemini 3 Pro. European Journal of Pediatrics 2026;185(7) View
Qiao M, Zhang H, Shen Y, Zhang T, Liu S, Lou Y, Yang L, Song C, Liu J. Undergraduate nursing students' attitudes and needs regarding the use of generative artificial intelligence in professional learning: a qualitative study. Frontiers in Public Health 2026;14 View

Conference Proceedings

Pattanshetti R, Sidddanagoudra S, Chand S, S P, Hebbar R, Vaishnavi . 2025 International Conference on Biomedical Engineering and Sustainable Healthcare (ICBMESH). Assessing the Performance of Large Language Models on the Foreign Medical Graduate Examination (FMGE): Insights from GPT-4 Turbo, Gemini Advanced, and LLaMA 3.1 (70B) View

Citation

Please cite as:

Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study
JMIR Form Res 2023;7:e48023
doi: 10.2196/48023 PMID: 37831496 PMCID: 10612006

Export Metadata

END for: Endnote

BibTeX for: BibDesk, LaTeX

RIS for: RefMan, Procite, Endnote, RefWorks

Add this article to your Mendeley library

This paper is in the following e-collection/theme issue:

Formative Evaluation of Digital Health Interventions (4991) e-Learning and Digital Medical Education (1551) Testing and Assessment in Medical Education (203) Artificial Intelligence (4589) Theme Issue: ChatGPT and Generative Language Models in Medical Education (144)

Download

Download PDF Download XML

Share Article

Share on Bluesky Share on Twitter Share on Facebook Share on LinkedIn