Background

JMIR Form Res

formative

JMIR Formative Research

JMIR Form Res

2561-326X

JMIR Publications

Toronto, Canada

v10i1e84904

10.2196/84904

Original Paper

Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items

Hölzing

Carlos Ramon

MD1Meynhardt

Charlotte

1Meybohm

Patrick

MD1König

Sarah

MD2Kranke

Peter

MD1

Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg

Oberdürrbacher Str. 6

Würzburg

GermanyInstitute of Medical Teaching and Medical Education Research, University Hospital Würzburg

Würzburg

Germany

Mavragani

Amaryllis

Rezigalla

Assad Ali

Ding

Liang

Correspondence to Carlos Ramon Hölzing, MD, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Oberdürrbacher Str. 6, Würzburg, 97080, Germany; hoelzing_c@ukw.de

2026

1822026

e84904

260920251812202524122025

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

Background

Multiple-choice examinations (MCQs) are widely used in medical education to ensure standardized and objective assessment. Developing high-quality items requires both subject expertise and methodological rigor. Large language models (LLMs) offer new opportunities for automated item generation. However, most evaluations rely on general-purpose prompting, and psychometric comparisons with faculty-written items remain scarce.

Objective

This study aimed to evaluate whether a fine-tuned LLM can generate MCQs (Type A) in anesthesiology with psychometric properties comparable to those written by expert faculty.

Methods

The study was embedded in the regular written anesthesiology examination of the eighth-semester medical curriculum with 157 students. The examination comprised 30 single best-answer MCQs, of which 15 were generated by senior faculty and 15 by a fine-tuned GPT-based model. A custom GPT-based (GPT-4) model was adapted with anesthesiology lecture slides, the National Competence-Based Learning Objectives Catalogue (NKLM 2.0), past examination questions, and faculty publications using supervised instruction-tuning with standardized prompt–response pairs. Item analysis followed established psychometric standards.

Results

In total, 29 items (14 expert, 15 LLM-generated) were analyzed. Expert-generated questions had a mean difficulty of 0.81 (SD 0.19), point-biserial correlation of 0.19 (SD 0.07), and discrimination index of 0.09 (SD 0.08). LLM-generated items had a mean difficulty of 0.79 (SD 0.18), point-biserial correlation of 0.17 (SD 0.04), and discrimination index of 0.08 (SD 0.11). Mann-Whitney U tests revealed no significant differences between expert- and LLM-generated items for difficulty (P=.38), point-biserial correlation coefficient (P=.96), or discrimination index (P=.59). Categorical analyses confirmed no significant group differences. Both sets, however, showed only modest psychometric quality.

Conclusions

Supervised fine-tuned LLMs are capable of generating MCQs with psychometric properties comparable to those written by experienced faculty. Given the limitations and cohort-dependency of psychometric indices, automated item generation should be considered a complement rather than a replacement for manual item writing. Further research with larger item sets and multi-institutional validation is needed to confirm generalizability and optimize integration of LLM-based tools into assessment development.

medical educationmultiple-choice questionslarge language modelsfine-tuningpsychometricsassessmentanesthesiologyartificial intelligence

Introduction

Multiple-choice questions (MCQs) are fundamental to the objective assessment of medical students. They allow standardized testing across large cohorts and play a central role in evaluating foundational and applied knowledge [1]. However, the development of high-quality MCQs demands not only deep domain knowledge but also significant methodological and didactic expertise [2,3]. Effective items must balance appropriate difficulty, plausible distractors, minimal cueing, and strong discriminatory power to differentiate between varying levels of student performance [4].

Recent advances in artificial intelligence (AI), particularly large language models (LLMs), offer novel tools for automated question generation. For instance, efforts comparing ChatGPT-3.5–generated MCQs with expert-written items in neurophysiology revealed similar difficulty levels but lower discriminatory power in LLM-generated questions [5]. A systematic review of LLM use in medical MCQ generation found that while LLMs can produce examination-relevant items, many require additional modification due to quality issues [6]. Other studies highlight linguistic and structural shortcomings in automatically generated MCQs, particularly regarding distractor plausibility and alignment with instructional content [7,8].

Recent domain-specific efforts such as Hypnos [9], CDGen [10], and the Chinese Anesthesiology Benchmark [11] have demonstrated that LLMs can be effectively fine-tuned or benchmarked within anesthesiology. However, these studies primarily focus on domain adaptation and benchmark performance rather than psychometric validation of automatically generated examination items. To address this gap, a GPT-based model was adapted using anesthesia-specific teaching materials, the National Competence-Based Learning Objectives Catalogue in Medicine (NKLM 2.0), past examination items, and faculty publications [12]. Item development for both expert- and AI-generated questions was systematically mapped to the NKLM 2.0, Bloom’s taxonomy, and the local examination blueprint to ensure comprehensive curricular coverage and to allow a fair psychometric comparison.

This study aimed to evaluate whether a fine-tuned LLM can generate MCQs (Type A) in anesthesiology with psychometric properties comparable to those written by expert faculty.

MethodsOverview

This study analyzed the performance of MCQs used in the regular written examinations of anesthesiology in the eighth semester of medical training with 157 students. The examination consisted of 30 items. Half of the items (n=15) were written by senior faculty members, and half (n=15) were generated by a fine-tuned LLM. Nine faculty members from the Department of Anesthesiology, each with at least 10 years of experience, participated in item creation. All had prior training in assessment design through institutional workshops on multiple-choice item writing. In addition, all items were independently reviewed by an educational specialist with a Master of Medical Education degree to ensure adherence to established item-writing principles. Faculty were aware of the study but blinded to the psychometric comparison during data collection.

Data analysis was performed fully anonymously. The participating students were regular medical students in their eighth semester. They were not informed about the origin of the examination questions and therefore did not know whether an item was generated by faculty or the LLM.

A customized GPT-based model was developed specifically for this study. The model was built as a domain-adapted instance of GPT-3.5-Turbo, configured to generate single-best-answer MCQs. Adaptation followed a supervised instruction-tuning approach: several hundred standardized prompt-response pairs were created using anesthesiology lecture slides, NKLM 2.0, past examination questions, and faculty publications. Faculty publications were included to capture authentic domain phrasing and ensure that the model reflected institution-specific conceptualizations of anesthetic procedures. Previous research shows that faculty development and the use of high-quality source material improve item validity and discrimination [13].

These materials were curated to align with Bloom’s taxonomy and national curricular requirements [14]. The fine-tuning pipeline can be found in Multimedia Appendix 1.

Item analysis followed established psychometric standards. Difficulty was defined as the mean proportion of correct responses (0‐1). Values between 0.30 and 0.70 are generally considered optimal, those greater than 0.70 indicate easy items, and those less than 0.30 indicate difficult items [15,16]. The point-biserial correlation was classified as follows: negative correlation (r<0), very low correlation (0≤r<0.10), low correlation (0.10≤r≤0.20), and acceptable correlation (r≥0.20) [15]. The discrimination index (D) was calculated as the difference in difficulty between the upper and lower 27% performance groups, with values of 0.40 and above considered excellent; 0.30‐0.39, good; 0.20‐0.29, acceptable; and those less than 0.20, poor [15,16]. Statistical analysis was performed using SPSS Statistics version 27 (IBM Corp). Graphs were created with Prism 9 (GraphPad Software). Nominal variables were summarized as counts and percentages. The Shapiro-Wilk test was used to test for normal distribution. Group comparisons of categorical data were performed with the chi-square test or Fisher exact test if expected frequencies were less than 5. Continuous data were reported as mean and SD values and compared using the Mann-Whitney U test. A significance level of P≤.05 was applied.

Figure 1 summarizes the item-generation workflow, including consolidated inputs, supervised instruction-tuning, the custom GPT MCQ generator, and parallel faculty-written items converging into the examination.

Ethical Considerations

The study was submitted to the Ethics Committee of the University of Würzburg, which confirmed (reference number 2024-‐258-ka on November 11, 2024) that no formal review was required and that no ethical objections were raised.

Figure 1.

Item-generation workflow. Consolidated inputs (anesthesiology lecture slides, NKLM 2.0, prior examination items, faculty publications) inform supervised instruction-tuning of a custom GPT configured for single-best-answer MCQs to produce 15 AI-generated questions. In parallel, faculty authored 15 questions. AI: artificial intelligence; MCQ: multiple-choice question.

Results

A total of 30 MCQs were analyzed. One expert-generated item was excluded from analysis due to a strongly negative discrimination index (–0.22) and negative point-biserial correlation (–0.20). Its difficulty (P=.86) indicated a ceiling effect, suggesting that most students answered it correctly despite unclear key wording.

The final dataset therefore included 14 expert-generated and 15 AI-generated items. Table 1 displays the descriptive metrics for expert- and AI-generated items. Expert-generated items showed a mean difficulty of 0.81 (SD 0.19), a mean point-biserial correlation of 0.16 (SD 0.07), and a mean discrimination index of 0.09 (SD 0.08). AI-generated items had a mean difficulty of 0.79 (SD 0.18), a mean point-biserial correlation of 0.17 (SD 0.04), and a mean discrimination index of 0.08 (SD 0.11). Mann-Whitney U tests indicated no significant differences between expert- and AI-generated items with respect to difficulty (P=.38), point-biserial correlation (P=.96), or discrimination index (P=.59; Figure 2).

Table 1.

Overview of question metrics by expert and artificial intelligence (AI).

Question created by	Questions, n	Metrics
		Minimum	Maximum	Difficulty, mean (SD)
Expert
Difficulty	14	0.48	0.99	0.81 (0.19)
Point-biserial correlation	14	−0.02	0.20	0.16 (0.07)
Discrimination index	14	0.01	0.24	0.09 (0.08)
AI
Difficulty	15	0.44	0.98	0.79 (0.18)
Point-biserial correlation	15	0.08	0.25	0.17 (0.04)
Discrimination index	15	−0.07	0.33	0.08 (0.11)

Figure 2.

Psychometric characteristics of AI- and expert-generated multiple-choice items. (A) Item difficulty, (B) point-biserial correlation, and (C) discrimination index is displayed for each question. AI: artificial intelligence.

Reference ranges are as follows:

Difficulty (P)=0.30-0.70 is considered desirable;

Discrimination (r_pb)≥0.20 is considered acceptable;

r_pb<0 indicates flawed items [17].

The categorical distributions of item properties are summarized in Table 2.

Table 2.

Psychometric characteristics of expert- and LLM^a-generated items displayed side by side for difficulty, point-biserial correlation, and discrimination index.

	Questions, n (%)
	Expert-generated (n=14)	LLM-generated (n=15)
Difficulty categories
Extremely difficult (P≤0.25)	0 (0)	0 (0)
Tends to heavy (0.25<P≤0.4)	0 (0)	0 (0)
Optimal difficulty (0.4<P≤0.8)	5 (35.7)	6 (40.0)
Very simple (0.8<P≤0.9)	0 (0)	3 (20.0)
Extremely simple (P>0.9)	9 (64.3)	6 (40.0)
Point-biserial correlation categories
Negative correlation (r<0)	1 (7.1)	0 (0)
Very low correlation (0≤r<0.10)	0 (0)	1 (6.7)
Low correlation (0.1≤r≤0.20)	8 (57.1)	10 (66.7)
Acceptable correlation (r≥0.20)	5 (35.7)	4 (26.7)
Discrimination index categories
Urgent need for revision (D’<0)	0 (0)	3 (20.0)
Need for revision (0≤D’<0.2)	12 (85.7)	11 (73.3)
Check required (0.2≤D’<0.3)	2 (14.3)	0 (0)
Potential for improvement (0.3≤D’<0.4)	0 (0)	1 (6.7)
Item effectively distinguishes (D’≥0.4)	0 (0)	0 (0)

^aLLM: large language model.

DiscussionPrincipal Findings

In this study, we compared psychometric properties (difficulty, point-biserial correlation, and discrimination) of MCQs generated by a supervised fine-tuned LLM with those written by expert faculty in an undergraduate anesthesiology examination. Although no statistically significant differences were observed, the overall quality of both item sets remained moderate. The point-biserial correlations and discrimination indices suggest that neither set reliably distinguishes higher- from lower-performing students, a finding consistent with previous research indicating that even expert-authored items often underperform in psychometric analyses [18]. This pattern aligns with broader evidence in medical education, where cohort studies have demonstrated that AI-generated MCQs often achieve discrimination indices similar to expert-generated items but tend to be easier overall and still require expert review to ensure distractor plausibility and alignment with higher-order learning objectives [5,19-23].

Supervised adaptation with domain-specific materials likely contributed to the close alignment of psychometric indices between AI and faculty-written items. Other work shows that when AI-mediated question generation is guided by domain content, structured prompts, or instruction tuning, the output more closely resembles faculty items in both difficulty and discrimination [7,8,19,24]. Notably, neither item set in our study consistently achieved high point-biserial correlation or discrimination, confirming that generating functionally effective distractors remains a challenge for both experts and LLMs [25-30]. Prior studies have similarly identified that AI items often underperform in assessing higher cognitive levels or using plausible distractors without ambiguity [8].

The absence of psychometric superiority in either group suggests that AI-assisted question generation can produce items of comparable statistical quality to traditional item writing. However, psychometric analysis alone is insufficient for examination quality assurance; human oversight remains essential to safeguard content validity, blueprint alignment, and cognitive level coverage. Studies in high-stakes examination settings show that expert review reduces factual inaccuracies and improves alignment with assessment blueprints [31]. Importantly, in our study, the fine-tuned LLM generated all 15 candidate items within a few minutes. While we evaluated only a subset psychometrically, our study demonstrates that domain-adapted LLMs support rapid item drafting at scale. Automatic item generation methods have long promised efficiency gains by expanding item pools from templates rather than crafting each item manually [32]. Recent AI studies show that LLM-based MCQ generation can approach human performance while drastically reducing human effort [33]. In practice, educators may use LLM throughput to generate large candidate sets and then filter, refine, and align items to the blueprint and cognitive levels, shifting effort from generation toward qualitative review and validation.

Limitations and Future Work

Study limitations include a small item sample size, single-institution administration, and fine-tuning with primarily local teaching resources, which may reduce external validity. Cognitive level of items (eg, recall vs application) was not measured, although comparative studies indicate this is an important differentiator between AI- vs expert-generated MCQs [31]. Future work should involve larger item pools, multi-institutional validation, and systematic qualitative review of items, including stem clarity, distractor plausibility, and distractor efficiency, as well as cognitive demand. It would also be valuable to compare different fine-tuning or prompt-engineering strategies and to assess students’ perceptions of AI-generated items [34].

Conclusion

This study demonstrates that a supervised fine-tuned LLM can generate MCQs with psychometric properties comparable to those created by experienced faculty. While neither approach consistently produced items with high point-biserial correlation or discrimination, the results indicate that automated question generation can complement traditional item writing in medical education.

None declared.

Abbreviations

artificial intelligence

LLM

large language model

MCQ

multiple-choice question

NKLM 2.0

National Competence-Based Learning Objectives Catalogue in Medicine

References1

St-Onge

Young

Renaud

Cummings

Drescher

Varpio

Sound practices: An exploratory study of building and monitoring multiple-choice exams at Canadian undergraduate medical education programs

Acad Med2021021962271277

10.1097/ACM.0000000000003659

32769474

Sideris

Singh

Catanzano

Writing High-Quality Multiple-Choice Questions

Image-Based Teaching2022

Springer

123146

10.1007/978-3-031-11890-6_9

Collins

Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules

Radiographics2006262543551

10.1148/rg.262055145

16549616

Caldwell

Pate

Effects of question formats on student and item performance

Am J Pharm Educ2013051377471

10.5688/ajpe77471

23716739

Laupichler

Rother

Grunwald Kadow

Ahmadi

Raupach

Large language models in medical education: Comparing ChatGPT- to human-generated exam questions

Acad Med2024051995508512

10.1097/ACM.0000000000005626

38166323

Artsi

Sorin

Konen

Glicksberg

Nadkarni

Klang

Large language models for generating medical examinations: systematic review

BMC Med Educ20240329241354

10.1186/s12909-024-05239-y

38553693

Grévisse

Pavlou

MAS

Schneider

Docimological quality analysis of LLM-generated multiple choice questions in computer science and medicine

SN Comput Sci202455636

10.1007/s42979-024-02963-6

Al Shuraiqi

Aal Abdulsalam

Masters

Zidoum

AlZaabi

Automatic generation of medical case-based multiple-choice questions (MCQs): A review of methodologies, applications, evaluation, and future directions

BDCC2024810139

10.3390/bdcc8100139

Wang

Jiang

Zhan

Hypnos: A domain-specific large language model for anesthesiology

Neurocomputing202504624129389

10.1016/j.neucom.2025.129389

Zhan

Fine-tuning LLMs for anesthesiology via compositional data generation

IEEE Trans Emerg Top Comput Intell20259640514065

10.1109/TETCI.2025.3567602

Zhou

Zhan

Wang

Benchmarking medical LLMs on anesthesiology: A comprehensive dataset in Chinese

IEEE Trans Emerg Top Comput Intell20259430573071

10.1109/TETCI.2024.3502465

Fakultätentag

Nationaler Kompetenzbasierter Lernzielkatalog Medizin2025

2026-02-10

https://nklm.de/menu

Naeem

van der Vleuten

Alfaris

Faculty development on item writing substantially improves item quality

Adv Health Sci Educ Theory Pract201208173369376

10.1007/s10459-011-9315-2

21837548

Bloom

Taxonomy of Educational Objectives19642

Longmans

Escudero

Reyna

Morales

The level of difficulty and discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA)

Revista Electrónica de Investigación Educativa2000

2026-02-10

212

https://redie.uabc.mx/redie/article/download/15/27/75

Möltner

Schellberg

Jünger

Grundlegende quantitative analysen medizinischer prüfungen

2006

GMS Zeitschrift für Medizinische Ausbildung

Moran

Item and Exam Analysis, in Item Writing for Nurse Educators2023

Springer International Publishing

5564

Rush

Rankin

White

The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value

BMC Med Educ20160929161250

10.1186/s12909-016-0773-3

27681933

Cheung

BHH

Lau

GKK

Wong

GTC

ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom)

PLOS ONE2023188e0290691

10.1371/journal.pone.0290691

37643186

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: A medical education pilot study with GPT-4

BMC Med Educ20231017231772

10.1186/s12909-023-04752-w

37848913

Ayub

Hamann

Davis

Exploring the potential and limitations of Chat Generative Pre-trained Transformer (ChatGPT) in generating board-style dermatology questions: A qualitative analysis

Cureus202308158e43717

10.7759/cureus.43717

37638266

Kaya

Sonmez

Halici

Yildirim

Coskun

Comparison of AI-generated and clinician-designed multiple-choice questions in emergency medicine exam: a psychometric analysis

BMC Med Educ2025071251949

10.1186/s12909-025-07528-6

40597998

Elzayyat

Mohammad

Zaqout

Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education

Med Educ Online2025123012554678

10.1080/10872981.2025.2554678

40884796

Emekli

Karahan

AI in radiography education: Evaluating multiple-choice questions difficulty and discrimination

J Med Imaging Radiat Sci202507564101896

10.1016/j.jmir.2025.101896

40157013

Bitew

Distractor generation for multiple-choice questions with predictive prompting and large language models

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

Sep 18-22, 2023

10.1007/978-3-031-74627-7_4

Baldwin

Mee

Yaneva

A natural-language-processing-based procedure for generating distractors for multiple-choice questions

Eval Health Prof202212454327340

10.1177/01632787211046981

34753326

Wang

High-quality distractors generation for human exam based on reinforcement learning from preference feedback

Natural Language Processing and Chinese Computing2025

Springer Nature Singapore

10.1007/978-981-97-9440-9_8

Zhang

VanLehn

Evaluation of auto-generated distractors in multiple choice questions from a semantic network

Interactive Learning Environments2021081829610191036

10.1080/10494820.2019.1619586

Abdulghani

Ahmad

Aldrees

Khalil

Ponnamperuma

The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis

J Health Spec201424148

10.4103/1658-600X.142784

Rezigalla

Eleragi

Elhussein

Item analysis: the impact of distractor efficiency on the difficulty index and discrimination power of multiple-choice items

BMC Med Educ20240424241445

10.1186/s12909-024-05433-y

38658912

Law

Lui

AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

BMC Med Educ2025028251208

10.1186/s12909-025-06796-6

39923067

Embretson

Kingston

Automatic item generation: A more efficient process for developing mathematics achievement items?

J Educational Measurement201803

2026-02-10

551112131

https://onlinelibrary.wiley.com/toc/17453984/55/1

10.1111/jedm.12166

Olney

Generating Multiple Choice Questions from a Textbook: LLMs Match Human Performance on Most Metrics2023

Grantee Submission

Student Perspective Matters for GenAI in Question Setting in Medical Education2025

Medical Science Educator

10.1007/s40670-025-02396-7

Multimedia Appendix 1

Technical workflow and dataset construction for fine-tuned large language model–mediated item generation.