Misinterpretation of Reliability Statistics

JMIR Form Res

formative

JMIR Formative Research

JMIR Form Res

2561-326X

JMIR Publications

Toronto, Canada

v10i1e91470

10.2196/91470

Letter to the Editor

Authors’ Reply: Critical Limitations in Comparing ChatGPT and DeepSeek for Orthopedic Assessment

Anusitviwat

Chirathit

MD1Suwannaphisit

Sitthiphong

MD2Bvonpanttarananon

Jongdee

1Tangtrakulwanich

Boonsin

MD1

Prince of Songkla University

15 Karnjanavanich Road

Hat Yai

Songkhla

ThailandNavamindradhiraj University

Bangkok

Thailand

Iannaccio

Amanda

Correspondence to Chirathit Anusitviwat, MD, Prince of Songkla University, 15 Karnjanavanich Road, Hat Yai, Songkhla, 90110, Thailand, 66 74451601; pchirathit@gmail.com

2026

1732026

e91470

150120261402202626022026

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

https://formative.jmir.org/2025/1/e75607

https://formative.jmir.org/2026/1/e90242/

ChatGPTlarge language modelLLMorthopedicmultiple-choice questionMCQ

We thank you for the useful and constructive comments [1] on our article “Comparing ChatGPT and DeepSeek for Assessment of Multiple-Choice Questions in Orthopedic Medical Education: Cross-Sectional Study” [2]. This reply aims to address the concerning points that were brought up in the letter to the editor.

Misinterpretation of Reliability Statistics

According to our study, we administered the multiple-choice questions (MCQs) for ChatGPT and DeepSeek on a separate day. All data from the two large language models (LLMs) were measured by two assessors. Although two assessors were used for each LLM, the reported Cohen κ coefficient values represent within-model interrater reliability, not interrater reliability between the two LLMs [3]. Therefore, describing these results as agreement between the two models is inaccurate.

Linguistic Ambiguity and Generalizability

All MCQs used in our study were administered in English. No Thai language inputs or translations were used. Therefore, the performance differences between the two models reflect the model performance on English language medical questions rather than variability due to language translation or non-English linguistic processing.

Reproducibility and Interface Transparency

All models in our study were accessed via web-based user interfaces (UIs), not application programming interfaces. We acknowledge that web-based UIs may be subject to updates and lack version control. However, the web-based version of ChatGPT is easy to access and requires no software installation. It also allows quick testing and exploration without technical or cost barriers, making it well-suited for nontechnical users and educational studies [4]. Therefore, we used the web-based UI in our study.

Risk of Data Contamination

Even though these MCQs have been used for more than 5 years, the MCQs used in our study are from private orthopedic examinations. Thus, we believe that these items would not appear in public sources. Future research using newly created MCQs may be better for assessing the capability or efficacy of LLMs.

Data Reporting Discrepancy

Upon re-examination, we confirm that the correct accuracy for the pelvic and spine injury category (n=19) using the Reason function is indeed 16 of 19, corresponding to approximately 84.2%. The value of 68.8% reported in Table 2 was a typographical error. This error has been corrected through a published corrigendum [5].

None declared.

Abbreviations

LLM

large language model

MCQ

multiple-choice question

user interface

References1

Ayas

Acar

Critical limitations in comparing ChatGPT and DeepSeek for orthopedic assessment

JMIR Form Res202610e90242

10.2196/90242

Anusitviwat

Suwannaphisit

Bvonpanttarananon

Tangtrakulwanich

Comparing ChatGPT and DeepSeek for assessment of multiple-choice questions in orthopedic medical education: cross-sectional study

JMIR Form Res202512199e75607

10.2196/75607

41418321

McHugh

Interrater reliability: the kappa statistic

Biochem Med (Zagreb)2012223276282

23092060

Park

Heo

Suh

Shim

Uncover this tech term: application programming interface for large language models

Korean J Radiol202508268793796

10.3348/kjr.2025.0360

40736411

Anusitviwat

Suwannaphisit

Bvonpanttarananon

Tangtrakulwanich

Correction: comparing ChatGPT and DeepSeek for assessment of multiple-choice questions in orthopedic medical education: cross-sectional study

JMIR Form Res2026022610e92549

10.2196/92549

41747218