Background

JMIR Form Res

formative

JMIR Formative Research

JMIR Form Res

2561-326X

JMIR Publications

Toronto, Canada

v10i1e86379

10.2196/86379

Original Paper

Beyond Area Under the Receiver Operating Characteristic Curve: Evaluating Predictive Performance Metrics Under Class Imbalance in Real-World Clinical Data

Ventura

Vanessa das Graças José

MSc, MD1Andrade

Claudio Moisés Valiense de

BSc, MSc2Almeida

Jussara Marques de

MSc, PhD2Pessoa

Bruno Porto

MSc, MD3Polanczyk

Carísi Anne

MSc, MD, PhD456Nascimento

Guilherme Fonseca do

BSc2Boersma

Eric

MSc, PhD7Vianna

Heloisa Reniers

MSc, MD8Farah

Katia de Paula

MSc, MD, PhD1Rocha

Leonardo Chaves Dutra da

MSc, PhD2Gonçalves

Marcos André

MSc, PhD2Marcolino

Milena Soriano

MSc, MD, PhD5910

Medical School and University Hospital, Universidade Federal de Minas Gerais

Avenida Alfredo Balena, 110

Belo Horizonte

BrazilDepartment of Computer Science, Universidade Federal de Minas Gerais

Belo Horizonte

BrazilHospital Júlia Kubitschek

Belo Horizonte

BrazilDepartment of Medicine Internal, Medical School, Universidade Federal do Rio Grande do Sul

Porto Alegre

BrazilInstitute for Health Assessment and Translation for Chronic and Neglected Diseases of High RElevance (IATS-CARE)

Belo Horizonte

BrazilHospital Moinhos de Vento

Porto Alegre

BrazilErasmus University Medical Center

Rotterdam

The NetherlandsHospital Universitário Ciências Médicas

Belo Horizonte

BrazilDepartment of Internal Medicine, Medical School, Medical School and University Hospital, Universidade Federal de Minas Gerais

Belo Horizonte

BrazilTelehealth Center, University Hospital, Universidade Federal de Minas Gerais

Belo Horizonte

Brazil

Sarvestan

Javad

Toma

Milan

Abd-Alsabour

Nadia

Correspondence to Vanessa das Graças José Ventura, MSc, MD, Medical School and University Hospital, Universidade Federal de Minas Gerais, Avenida Alfredo Balena, 110, Belo Horizonte, 30130-100, Brazil, 55 31991314221; nessachesed@yahoo.com.br

2026

2462026

e86379

011120252004202622042026

© Vanessa das Graças José Ventura, Claudio Moisés Valiense de Andrade, Jussara Marques de Almeida, Bruno Porto Pessoa, Carísi Anne Polanczyk, Guilherme Fonseca do Nascimento, Eric Boersma, Heloisa Reniers Vianna, Katia de Paula Farah, Leonardo Chaves Dutra da Rocha, Marcos André Gonçalves, Milena Soriano Marcolino. Originally published in JMIR Formative Research (https://formative.jmir.org), 24.6.2026.

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

Background

Predictive models increasingly support clinical decision-making, although imbalanced outcome distributions are common in health care datasets and can distort performance evaluation. The area under the receiver operating characteristic curve (AUROC) remains the most frequently reported metric, despite its limited ability to reflect clinically meaningful performance under class imbalance.

Objective

This study aimed to examine the influences of metric selection on the clinical interpretation of predictive models in imbalanced real-world health care data.

Methods

This was a retrospective cohort study, including 17,018 hospitalized patients with COVID-19. Two predictive models using extreme gradient boosting (XGBoost) were developed to predict kidney replacement therapy (KRT) and mortality. Model performance was assessed using AUROC, macro-F₁-score, class-specific precision and recall, calibration (curve, slope, and intercept), decision curve analysis, and learning curves. Standard rebalancing strategies were applied exclusively to the training data to evaluate their impact on performance.

Results

KRT occurred in 9.5%, and mortality in 18.0%. Although AUROC values were high (0.928 for KRT and 0.945 for mortality), performance in the minority class was substantially lower. For KRT, precision was 0.539 and recall 0.372; for mortality, precision was 0.725 and recall 0.718. Rebalancing strategies were associated with higher recall for the minority class, but this gain was accompanied by a reduction in precision, with minimal impact on AUROC values. As a result, AUROC remained high despite clinically relevant changes in error distribution between false positives and false negatives. The learning curves show a plateau-like shape, with stable validation performance across all training set sizes for both outcomes.

Conclusions

AUROC alone is insufficient to evaluate prediction models in imbalanced health care scenarios, even with rebalancing. Routine reporting of class-aware metrics, alongside learning curve analysis, is essential to support robust and clinically meaningful evaluation of predictive models, rather than their direct translation into practice.

predictive modelartificial intelligencelearning curvearea under the receiver operating characteristic curveF1-scoreperformance metrics

Introduction

Clinical prediction models are increasingly used in health care to support diagnostic, prognostic, and therapeutic decisions [1,2]. Their adoption has expanded with advances in machine learning (ML) and access to large-scale electronic health data, enabling the development of models with the potential to improve risk stratification and personalized care [3]. However, the evaluation of these models frequently relies on metrics that may not reflect real-world clinical usefulness, especially when outcome distributions are imbalanced [4].

The area under the receiver operating characteristic curve (AUROC) is the most commonly reported metric in clinical prediction research [5,6]. Although AUROC is intuitive and threshold-independent, it often overestimates performance in datasets in which one class (eg, survival or absence of disease) predominates, a common characteristic in clinical datasets [7-9]. In such settings, AUROC may suggest high discriminative ability while concealing poor sensitivity for minority-class outcomes, such as death or the need for critical interventions [7-9].

For example, risk prediction scores have long been recommended in clinical practice to guide preventive strategies, particularly in cardiovascular disease. The Framingham score, once widely used for predicting 10-year cardiovascular outcomes, illustrates the risk of relying on AUROC-based metrics [10]. Although it showed acceptable discrimination (C-statistic: 0.763 for men and 0.793 for women) [10,11], the dataset exhibited class imbalance (10.08% women and 18.09% men with outcomes) [10], and the score likely performed better for healthy individuals while failing to identify many at-risk patients early, potentially missing opportunities for interventions that could have improved outcomes [12-14].

For this reason, calibration measures are essential complements to discrimination, since high AUROC values do not necessarily guarantee reliable probability estimates or clinical applicability [6,15,16]. More recent cardiovascular risk prediction models, such as SCORE2, incorporated improved calibration across European populations, but their performance reporting still relies heavily on AUROC [17].

This illustrates a broader issue: even when calibration is addressed, discrimination metrics alone can mask poor identification of minority outcomes, underscoring the need for comprehensive evaluation strategies. Additionally, although the Hosmer-Lemeshow test is a commonly used goodness-of-fit test for logistic regression models, it is less suitable for ML models due to its sensitivity to sample size and arbitrary grouping of predicted probabilities [18].

While these limitations are widely recognized in the ML literature, clinical studies continue to prioritize AUROC in model reporting [19-21]. Most discussions on metric limitations remain either theoretical or based on synthetic datasets [22-25]. As a result, model outputs often lack interpretability and applicability for health professionals, limiting their practical relevance for clinical implementation [26,27]. There are few applied studies using large real-world clinical datasets that demonstrate, in concrete terms, how metric selection affects the identification of high-risk patients and subsequent clinical decisions [4,22,23,27].

Recently, Carriero et al [28] reported the challenges posed by imbalanced datasets in predictive modeling, showing that common strategies to deal with class-imbalance issues, such as oversampling and undersampling, may compromise calibration, leading to overestimated risk predictions and systematic bias [28]. These findings highlight the need for evaluation strategies that move beyond AUROC and artificial rebalancing, offering instead a comprehensive assessment of model performance that prioritizes clinical reliability and patient safety.

This study addresses this gap by applying a structured evaluation of predictive model performance in a real-world clinical setting, using a large, multicenter dataset of hospitalized patients with COVID-19 in Brazil. As a case study to illustrate the impact of class imbalance on model evaluation, we developed 2 ML models to predict kidney replacement therapy (KRT) and in-hospital mortality, outcomes with different prevalence levels, and assessed them using metrics that capture different aspects of model performance. Rather than proposing new predictive models, this study focuses on how commonly used performance metrics influence the interpretation of model usefulness in real-world, imbalanced clinical settings. Beyond AUROC, we focused on class-specific precision, recall, and macro-F₁-scores, which, although well-established in data science, remain underutilized in clinical contexts. Additionally, we critically examine how metric selection influences clinical interpretation in imbalanced scenarios, making our findings relevant not only to data scientists but also to health care professionals. In doing so, this study helps bridge the gap between methodological rigor and clinical applicability in predictive model evaluation.

MethodsStudy Design

This was a retrospective cohort study. We collected data on consecutive adult patients (aged 18 years and older) with laboratory-confirmed COVID-19 [29], admitted in one of 41 participating hospitals in Brazil from March 2020 to August 2022. Details of the cohort have been published elsewhere [30]. Pregnant women; patients undergoing palliative treatment, or with a history of prior KRT or already in KRT upon hospital presentation; and those who were transferred from or to another hospital were excluded from this particular analysis (Figure 1).

Two predictive models were developed and validated: 1 for KRT and 1 for in-hospital mortality. Both models presented imbalanced class distributions, but in different proportions, and were used as case studies. These outcomes were selected due to their clinical relevance and prognostic implications in hospitalized patients with COVID-19 [31].

Figure 1.

Flowchart of the patients included in the study. KRT: kidney replacement therapy.

Data Collection

Sociodemographic, clinical, and laboratory data; medications; interventions; and outcomes were extracted from medical records by trained researchers using the REDCap (Research Electronic Data Capture) electronic platform [32,33], hosted at the Telehealth Center of the University Hospital of the Universidade Federal de Minas Gerais [34,35]. An automated data verification algorithm was implemented to ensure data quality, checking for inconsistencies. Any discrepancies were resolved in consultation with the coordinating researchers.

Predictors and Outcome Definition

Candidate predictors were selected based on clinical relevance, prior literature, and data availability (Multimedia Appendices 1 and 2). No automated feature selection was applied, as the study objective was not to maximize predictive performance but to assess how different evaluation metrics behave under identical modeling conditions. The same predictor set was maintained across all experiments to ensure comparability between models and evaluation strategies. KRT was defined as the initiation of dialysis during hospitalization, excluding patients with preexisting chronic dialysis. In-hospital mortality refers to death occurring during hospitalization, as documented in medical records. Patients were classified into binary outcome groups for each end point (KRT vs no KRT; death vs survival).

The Predictive Models

Extreme gradient boosting (XGBoost) was chosen due to its strong performance in structured clinical data, ability to capture nonlinear relationships, native handling of missing values, and favorable calibration properties reported in prior studies [36-38]. Since XGBoost supports missing values natively, no imputation method was used in the primary analysis.

Cross-Validation and Modeling Pipeline

A 10-fold stratified cross-validation strategy was used. In each iteration, 1-fold was held out as the test set, while the remaining 9-folds constituted the training set. Within this training partition, a further split was performed to create a validation subset used exclusively for hyperparameter tuning and model selection. All preprocessing and rebalancing procedures that did not require imputation, which are detailed in the next subsection, were performed strictly within the training data of each fold. The test set remained fully held out and was used only for final performance evaluation, preserving the original data distribution and preventing data leakage.

Three key hyperparameters were systematically explored: (1) booster (gbtree, gblinear, and dart), which defines the base learner used to build the ensemble; (2) eta (learning rate), which controls the step-size shrinkage during boosting to prevent overfitting by making the learning process more conservative; and (3) max_depth, which controls the maximum depth of individual trees, thereby regulating model complexity. The complete grid of values evaluated for each hyperparameter is reported in Multimedia Appendix 3.

After selecting the optimal hyperparameter configuration, the model was retrained on the full training data (training + validation) following standard practice and prior work [39,40], as well as the default behavior of widely used libraries such as scikit-learn (GridSearchCV with refit=True). The held-out test fold was then used exclusively for final performance evaluation. This process was repeated across all folds, so that each fold served once as the test set, and the reported results correspond to the average performance across the 10 iterations. This strategy ensures robust performance estimation and minimizes data leakage [41,42]. The overview of the analytical pipeline is presented in Figure 2.

Figure 2.

Overview of the analytical pipeline applied within each cross-validation iteration. *booster, eta, max_depth. CV: cross-validation.

Handling of Class Imbalance

To assess the impact of data imbalance on model performance, both oversampling and undersampling techniques were applied exclusively on the training data. For each method, default resampling parameters were used, without additional tuning, following the standard implementation of each algorithm. Due to the intrinsic operational characteristics of certain resampling methods, a fully balanced (1:1) class distribution was not always achieved. The final class proportions after resampling are presented in Multimedia Appendix 3.

Oversampling techniques included Random Oversampling [43], Adaptive Synthetic [44], Synthetic Minority Oversampling Technique (SMOTE) [45], BorderlineSMOTE [46], SVMSMOTE [47], and KMeansSMOTE [48], which increase minority-class representation through random duplication or synthetic sample generation [43,49]. Undersampling methods included Random Undersampling [44], Redundancy-Based Undersampling [39], e2sc-us (Effective, Efficient, and Scalable Confidence-Based-UnderSampling) [39], Condensed Nearest Neighbor [50], Near Miss 1 [51], and Near Miss 2 [51], which reduce majority-class instances while attempting to preserve relevant decision boundaries [39].

Because most resampling algorithms cannot handle missing values, the MissForest imputation method was incorporated into the pipeline when required, exclusively within the training data for rebalancing experiments [52]. Importantly, the imputation model was fitted exclusively on the training partition within each cross-validation fold and subsequently applied to the corresponding validation and test sets, thereby preventing information leakage.

For each resampling strategy, the entire modeling pipeline, including imputation (when applicable), resampling, and model training, was re-executed within each cross-validation iteration. Hyperparameter tuning was performed from scratch for each resampled dataset, ensuring that model optimization was specific to each data configuration. Details of the hyperparameter search space are provided in Multimedia Appendix 4. No information from the test set was used at any stage of model development, including imputation, resampling, or hyperparameter tuning.

To enable direct comparison across models, a fixed probability threshold of 0.5 was used for classification. This threshold was selected based on its consistent performance across preliminary analyses. All experiments were conducted using a fixed random seed (random_state=42), applied consistently across all stochastic components of the pipeline, including cross-validation splitting, imputation, resampling procedures, and model initialization, ensuring full reproducibility.

Performance Evaluation

Model performance was assessed using a complementary set of metrics that offer different perspectives to evaluate model performance, capturing global discrimination, class-specific behavior, calibration, and clinical usefulness (Multimedia Appendices 5 and 6). All metrics were computed on the held-out test fold in each iteration and averaged across folds. Specifically, we analyzed global metrics, such as accuracy and AUROC, as well as metrics that are more sensitive to class imbalance, such as macro-F₁ and per-class precision and recall, including precision and recall across both majority and minority classes and the impact of different decision thresholds on the values of these metrics.

We first evaluated global performance using accuracy and AUROC. These metrics summarize overall discrimination across all instances but treat all elements equally, regardless of their class, which inherently biases these metrics toward the majority class in imbalanced datasets [2,4]. Accuracy represents the proportion of correctly classified instances [4], while AUROC quantifies the model’s ability to rank positive cases higher than negative ones across all decision thresholds [6,11]. However, what is considered a “correct” prediction depends on the chosen decision threshold for the risk score, as different thresholds influence the balance between sensitivity and specificity [4].

To explicitly capture performance under class imbalance, we additionally reported per-class precision, recall, and F₁-score, as well as macro-F₁, which assigns equal weight to each class regardless of prevalence [50,51,53]. To better characterize performance under class imbalance, we first examined class-specific precision and recall, which directly quantify errors for both minority and majority outcomes. Regarding the positive class, recall (sensitivity) reflects the proportion of true cases correctly identified, whereas precision (positive predictive value) reflects the proportion of predicted positives that are true events. The same logic applies to the negative class, where recall corresponds to specificity and precision to negative predictive value [54,55].

In addition to these class-specific metrics, we evaluated resampling strategies using TPRGap, a bias-oriented measure that quantifies performance disparity between classes as the absolute difference between their true-positive rates. This metric directly captures classifier favoritism toward the majority class, which may persist even when global performance measures remain high [56]. Finally, to summarize the trade-off between precision and recall in a single indicator, we reported the F₁-score and its macroaveraged form, which assigns equal weight to each class and is therefore robust to outcome imbalance [50,51,54,57].

While this perspective is common in the ML literature, it may be less intuitive for health care professionals, who are generally more familiar with metrics such as sensitivity and specificity. By reporting both precision and recall for each class, we provide a more nuanced and clinically interpretable understanding of model performance, especially relevant in the presence of class imbalance [50,51,54,57]. This approach enables assessment not only of how well the model identifies patients at risk but also how confidently it excludes those unlikely to experience the outcome. Therefore, it supports a more comprehensive assessment of predictive usefulness and more informed decision-making in clinical applications.

Primary analyses were conducted using a default probability threshold of 0.5, consistent with standard binary classification practice. To explore clinically relevant trade-offs between missed events and false alarms, we further evaluated precision-recall behavior across varying decision thresholds using precision-recall curves [50,56]. The precision-recall curve was generated by plotting precision against recall at various decision thresholds [54].

Model calibration was assessed using the plot with predicted probability against observed probability, testing intercept equals zero and slope equals 1. In a well-calibrated model, there is agreement between observed and predicted events, allowing the probability to be interpreted as the confidence in the prediction [58,59]. In addition, the global accuracy of the model was assessed using the Brier score. The Brier score ranges from 0 to 1, with lower values indicating better probabilistic accuracy [15].

Clinical usefulness was assessed through decision curve analysis, which quantifies net benefit across a range of decision thresholds compared with “treat-all” and “treat-none” strategies [55,57]. While decision curves assess whether model-guided decisions outperform simple strategies, they do not ensure balanced error distribution or detect bias toward the majority class, reinforcing the need for class-specific performance metrics [55,57]. Finally, the learning curves were used as a graphical representation of how a model’s performance evolves as training data are added [60].

Risk-of-Bias Assessment and Reporting

This study adheres to the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis + Artificial Intelligence) standards for transparent reporting (Multimedia Appendix 7) [6]. To ensure methodological rigor, we used the PROBAST+AI (Updated Quality, Risk of Bias, and Applicability Assessment Tool for Prediction Models Using Regression or Artificial Intelligence Methods) to assess risk of bias and applicability (Multimedia Appendix 8). The study was considered to have a low risk of bias in all domains (participants, predictors, outcomes, and analysis). However, the lack of external validation of the model should be considered as a point of attention in the domain of analysis. Applicability concerns were judged to be low across all domains [16].

Ethical Considerations

The study was approved by the Brazilian National Research Ethics Committee—Comissão Nacional de Ética em Pesquisa (CAAE 30350820.5.0000.0008) and internal approval of ethics boards from each hospital. Individual informed consent term was waived due to the pandemic situation and analysis of deidentified data, based on chart review only.

ResultsOverview

The database included 17,018 patients (median age of 60 years, IQR 37‐83; 54.6% were men). The outcome distributions were highly imbalanced (Figure 3). Approximately 9.5% (1617/17,018) of the patients underwent KRT (1617 patients), resulting in an imbalance ratio of 9.5:1. Similarly, 18% (3063/17,018) of the patients died, corresponding to an imbalance ratio of 4.6:1.

Figure 3.

Imbalanced outcome class distribution. (A) KRT; (B) in-hospital mortality. KRT: kidney replacement therapy.

Prediction of KRT

The predictive XGBoost model demonstrated high overall performance when considering accuracy (0.910) and AUROC (0.928; Table 1). However, due to class imbalance, the other metrics revealed some important observations that would otherwise be overlooked. Notably, the macro-F₁-score of 0.695 suggests relatively lower performance, specifically for the minority class (KRT=yes; Tables 1 and 2). This is mainly due to the low recall (0.372), indicating that the model struggles to correctly identify a large proportion of actual KRT cases. The precision for this class was 0.539, resulting in an F₁-score of 0.439 (Table 2).

Table 1.

Global metrics at different cutoff thresholds for kidney replacement therapy^a.

Cutoff	Accuracy	AUROC^b	Macro-F₁	Brier
50%	0.910 (0.906‐0.914)	0.928 (0.923‐0.933)	0.695 (0.682‐0.708)	0.066 (0.063‐0.069)
40%	0.907 (0.902‐0.912)	0.930 (0.923‐0.937)	0.711 (0.694‐0.728)	0.062 (0.060‐0.064)
30%	0.901 (0.896‐0.906)	0.930 (0.923‐0.937)	0.731 (0.719‐0.743)	0.062 (0.060‐0.064)
20%	0.890 (0.884‐0.896)	0.930 (0.923‐0.937)	0.742 (0.729‐0.755)	0.062 (0.060‐0.064)
10%	0.865 (0.861‐0.869)	0.930 (0.923‐0.937)	0.731 (0.723‐0.739)	0.062 (0.060‐0.064)

^aData are presented as mean (95% CI).

^bAUROC: area under the receiver operating characteristic curve.

Table 2.

Per-class metrics at different cutoff thresholds for kidney replacement therapy^a.

Cutoff	Precision		Recall		F₁
	KRT^b	No KRT	KRT	No KRT	KRT	No KRT
50%	0.539(0.508‐0.570)	0.936(0.934‐0.938)	0.372(0.345‐0.399)	0.966(0.961‐0.971)	0.439(0.415‐0.463)	0.951(0.949‐0.953)
40%	0.511(0.481‐0.541)	0.942(0.938‐0.946)	0.442(0.406‐0.478)	0.956(0.953‐0.959)	0.474(0.442‐0.506)	0.949(0.946‐0.952)
30%	0.480(0.460‐0.500)	0.953(0.950‐0.956)	0.563(0.536‐0.590)	0.936(0.933‐0.939)	0.518(0.495‐0.541)	0.945(0.942‐0.948)
20%	0.449(0.430‐0.468)	0.967(0.963‐0.971)	0.701(0.668‐0.734)	0.910(0.905‐0.915)	0.547(0.525‐0.569)	0.937(0.934‐0.940)
10%	0.400(0.390‐0.410)	0.981(0.978‐0.984)	0.837(0.817‐0.857)	0.868(0.864‐0.872)	0.542(0.529‐0.555)	0.921(0.918‐0.924)

^aData are presented as mean (95% CI).

^bKRT: kidney replacement therapy.

The confusion matrices for KRT prediction also highlight this trend, showing that lowering the threshold from 50% to 10% enhances sensitivity, evidenced by a 174.5% increase in true positive (from 51 to 140). However, this adjustment also leads to a 284.0% rise in false positive (from 50 to 192; Figure 4A).

Figure 4.

(A) Confusion matrix at different cutoff thresholds to predict kidney replacement therapy. (B) Confusion matrix at different cutoff thresholds to predict in-hospital mortality.

Changing the cutoff threshold affected the model’s precision, recall, and F₁ values. In this context, considering the class of interest, which is KRT, and the default cutoff threshold of 50%, the precision was 0.539, recall was 0.372, and F₁-score was 0.439 (Table 2). Lowering the cutoff threshold to 20% resulted in a precision of 0.449, recall of 0.701, and an improved F₁-score of 0.547 (Table 2). The data presented in Multimedia Appendix 9 elucidate the trade-off between precision and recall, where an increase in precision usually implies a reduction in recall and vice versa.

The calibration plot shows systematic deviation from the diagonal, with predicted probabilities falling below the diagonal at higher values and above it at lower values, indicating overconfidence at the extremes (Figure 5A). This pattern is further supported by a calibration slope of 0.60 and an intercept of −0.14 (Figure 5B), indicating both overconfident predictions and a slight global overestimation of risk.

Figure 5.

(A). Calibration curve for kidney replacement therapy. (B). Plot showing the calibration slope and intercept for the kidney replacement therapy task. (C). Calibration curve for death. (D). Plot showing the calibration slope and intercept for the death.

The precision-recall curve for each class (Figure 6A) showed that the non-KRT class achieved high precision and recall simultaneously, which is desirable, while the KRT class showed a performance closer to random. In other words, the model is relatively good at recalling non–KRT patients (high specificity), but it struggles to identify the ones who underwent KRT (low sensitivity).

It is observed that the curve for the class that did not undergo KRT remains close to the upper right corner, indicating that the model can achieve high precision and recall rates simultaneously. Conversely, the curve for the (interest) minority class (underwent KRT) approaches the diagonal, suggesting that the model is struggling to balance precision and recall, with performance close to random.

The decision curve analysis (Multimedia Appendix 10) indicates that the proposed model generates a positive net benefit for low to moderate decision thresholds, starting at approximately 0.10 and gradually decreasing to zero as the threshold increases to 0.5. Consequently, the model exhibits a practical benefit for decision thresholds (P≤.5), indicating usefulness in scenarios that tolerate decisions based on relatively moderate predicted probabilities. In contrast, the strategy of treating all cases shows a positive net benefit only at very low thresholds, reaching zero around a threshold of 0.10 and becoming increasingly negative thereafter.

Figure 6.

(A) Precision-recall curves for patients undergoing KRT and not undergoing KRT. (B) Precision-recall curves for death, and no death. KRT: kidney replacement therapy.

Prediction of In-Hospital Mortality

The predictive XGBoost model achieved high values of accuracy (0.900) and AUROC (0.945; Table 3). However, due to class imbalance, the macro-F₁-score of 0.830 indicates lower performance (Table 3). Specifically, for the minority class (deceased=yes), the precision is 0.725, the recall is 0.718, and the F₁-score is 0.721 (Table 4).

Table 3.

Global metrics at different cutoff thresholds for death^a.

Cutoff	Accuracy	AUROC^b	Macro-F₁	Brier
50%	0.900 (0.896‐0.904)	0.945 (0.939‐0.951)	0.830 (0.825‐0.835)	0.072 (0.069‐0.075)
40%	0.900 (0.895‐0.905)	0.945 (0.939‐0.951)	0.837 (0.830‐0.844)	0.072 (0.069‐0.075)
30%	0.896 (0.891‐0.901)	0.945 (0.939‐0.951)	0.837 (0.829‐0.845)	0.072 (0.069‐0.075)
20%	0.893 (0.887‐0.899)	0.945 (0.939‐0.951)	0.840 (0.831‐0.849)	0.072 (0.069‐0.075)
10%	0.878 (0.870‐0.886)	0.945 (0.939‐0.951)	0.827 (0.815‐0.839)	0.072 (0.069‐0.075)

^aData are presented as mean (95% CI).

^bAUROC: area under the receiver operating characteristic curve.

The confusion matrices for mortality prediction demonstrated a similar precision-recall trade-off (Figure 4B). Reducing the cutoff threshold from 50% to 10% enhanced recall, with true positives increasing by approximately 29.1% (from 202 to 261), but also resulted in a 113% increase in false positives (from 88 to 188).

As with KRT, the cutoff threshold affected the precision, recall, and F₁-scores. At a 50% cutoff threshold, the precision is 0.725, the recall is 0.718, and the F₁-score is 0.721 (Table 4). Lowering the threshold to 20% increased the recall (0.880) and improved the F₁-score (0.748), while precision decreased to 0.651 (Table 4 and Multimedia Appendix 4). Similar to KRT, the calibration plot (slope=0.72; intercept=−0.16) was not satisfactory, and the Brier score was low (0.072; Figure 5C and D and Table 3).

The precision-recall curves for each class (Figure 6B) followed a similar pattern to that observed for KRT, with the non–death class exhibiting higher precision and recall than the death class. Once again, the model performed well in identifying survivors (high specificity) but demonstrated limited ability to detect patients who died (low sensitivity).

It is observed that the curve for the non–death class remains close to the upper right corner, indicating that the model can achieve high precision and recall rates simultaneously. In contrast, the curve for the death class, which is the minority class and of greater interest, approaches the diagonal (random model).

Similar to KRT, the decision curve for death shows that the net benefit of the strategy of treating all cases is approximately 0.2, while the proposed model achieves a substantially higher net benefit, around 0.8 (Multimedia Appendix 11). This behavior demonstrates that indiscriminate intervention quickly becomes inadequate as the decision threshold increases, while the proposed model maintains its practical usefulness over a substantially wider range of thresholds.

Table 4.

Per-class metrics at different cutoff thresholds for death^a.

Cutoff	Precision		Recall		F₁
	Death	No death	Death	No death	Death	No death
50%	0.725(0.703‐0.747)	0.938(0.934‐0.942)	0.718(0.710‐0.726)	0.940(0.934‐0.946)	0.721(0.712‐0.730)	0.939(0.937‐0.941)
40%	0.701(0.678‐0.724)	0.949(0.946‐0.952)	0.776(0.768‐0.784)	0.927(0.920‐0.934)	0.736(0.725‐0.747)	0.938(0.935‐0.941)
30%	0.673(0.652‐0.694)	0.958(0.955‐0.961)	0.821(0.811‐0.831)	0.912(0.906‐0.918)	0.739(0.726‐0.752)	0.935(0.931‐0.939)
20%	0.651(0.630‐0.672)	0.971(0.969‐0.973)	0.880(0.872‐0.888)	0.896(0.889‐0.903)	0.748(0.734‐0.762)	0.932(0.928‐0.936)
10%	0.608(0.585‐0.631)	0.980(0.977‐0.983)	0.921(0.910‐0.932)	0.869(0.861‐0.877)	0.732(0.714‐0.750)	0.921(0.916‐0.926)

^aData are presented as mean (95% CI).

The Influence of Class Imbalance on Prediction

The model for KRT, which exhibited higher class imbalance, demonstrated superior performance for accuracy and AUROC when compared with the model for mortality. However, when examining the precision and recall for the minority class, the performance was suboptimal, and the KRT model exhibited lower performance than the mortality model. This pattern was also reflected in the macro-F₁-score, where the KRT model displayed a more significant drop in performance, further highlighting the impact of class imbalance.

It is important to highlight that KRT represents a distinct endpoint from mortality. Although both models used the same set of variables, the features and the importance of each feature vary depending on the endpoint (Multimedia Appendices 12 and 13). Therefore, discrepancies in the performance of the KRT and mortality models should not be solely attributed to differences in class balance.

Learning Curves Analysis

The learning curves show a plateau-like shape, with stable validation performance across all training set sizes for both outcomes (Figures 7A and 7B). This pattern suggests limited change in performance as the training set size increases, indicating that model performance does not substantially improve with additional data.

Figure 7.

(A) Learning curves for different training set sizes for kidney replacement therapy. (B) Learning curves for different training set sizes for death. AUROC: area under the receiver operating characteristic curve; F₁: F₁-score; KRT: kidney replacement therapy.

Impact of Balancing Strategies on Prediction Results and on the Metrics

When evaluating the impact of class rebalancing strategies, we observe a clear and systematic discrepancy between AUROC and metrics that explicitly account for class-specific behavior. As shown in Multimedia Appendices 12 and 13, AUROC remains consistently high across all experimental conditions, exceeding 0.8 in every scenario, regardless of whether rebalancing is applied or whether minority-class performance substantially improves or deteriorates. This stability may give the misleading impression that rebalancing strategies have limited effect on model behavior.

However, a closer inspection using alternative metrics reveals a markedly different picture. In particular, TPRGap provides a direct measure of classifier bias induced by class imbalance, capturing disparities in true-positive rates between classes. Using this metric, several rebalancing techniques substantially reduce bias relative to the unbalanced baseline. For instance, in the death outcome, TPRGap decreases from 0.230 in the unbalanced setting to 0.043 when using Redundancy-Based Undersampling, indicating a pronounced reduction in class-dependent performance disparity. In contrast, AUROC changes only marginally in the same scenario, from 0.945 to 0.941, failing to reflect this improvement.

Similar patterns are observed across other class-aware metrics, particularly positive-class precision, recall, and F₁-score. These metrics exhibit significant sensitivity to rebalancing strategies, capturing both beneficial and harmful effects on minority-class performance. In several cases, rebalancing leads to meaningful gains in recall at the expense of precision, or vice versa, reflecting trade-offs that are critical in clinical decision-making contexts. Yet, AUROC remains largely unchanged, masking these trade-offs and providing little insight into how the classifier’s behavior actually shifts.

An especially illustrative example arises in the death outcome under the Near Miss 2 undersampling strategy. In this case, the positive-class F₁-score drops sharply from 0.721 in the unbalanced model to 0.345, signaling a severe degradation in clinically relevant performance. Despite this substantial decline, AUROC remains comparatively high, decreasing from 0.945 to 0.809. This modest reduction does not adequately reflect the magnitude of the performance loss experienced by the minority class, underscoring the disconnect between AUROC and clinically meaningful outcomes.

DiscussionPrincipal Findings

In recent years, the rapid expansion of ML applications in health care has led to an increasing number of predictive models being proposed for clinical use. However, many of these studies continue to rely primarily, or even exclusively, on AUROC to report model performance, even in highly imbalanced clinical scenarios. Our findings demonstrate the limitations of this practice, showing that high AUROC values may coexist with poor performance for clinically critical minority outcomes.

Although methodological literature has long acknowledged the limitations of AUROC and accuracy in imbalanced settings, clinical prediction studies frequently continue to emphasize these global metrics. By applying complementary class-specific measures and analyzing the learning curves in a large cohort of over 17,000 hospitalized patients with COVID-19, our study provides practical evidence of how metric selection directly influences clinical interpretation, particularly when outcomes are imbalanced.

For both KRT and in-hospital mortality, AUROC values suggested excellent discrimination. However, class-specific metrics revealed substantial deficiencies in identifying minority outcomes. The KRT model achieved an AUROC of 0.928. However, recall for the KRT class was only 0.372, meaning that the model failed to identify nearly two-thirds of patients who would require dialysis. In a real clinical scenario, such as a hospital without dialysis services, this could result in missed opportunities for early referrals, with serious consequences for patient care. This discrepancy was captured by the macro-F₁-score (0.695), which penalizes imbalanced performance across classes and thus offers a more clinically realistic summary of the model’s performance than AUROC alone.

Calibration analysis further highlighted these limitations. Calibration slopes of 0.60 and 0.72, along with negative intercepts, indicate suboptimal calibration for both outcomes. Although Brier scores were low, this likely reflects strong discrimination combined with outcome imbalance, rather than well-calibrated probability estimates. These findings reinforce that no single metric adequately captures model performance in imbalanced clinical settings.

Precision-recall analysis further highlighted the limitations of AUROC-based evaluation by prioritizing performance on the minority class, which often represents the clinically most relevant outcome [57]. While decision curve analysis further illustrates that clinically meaningful usefulness may vary substantially across thresholds [61,62] and demonstrated net benefit across certain thresholds, it did not capture how prediction errors were distributed between classes. By examining class-specific precision and recall across varying thresholds, we directly linked model behavior to real-world clinical trade-offs between underdiagnosis and overdiagnosis. Given that class prevalence directly affects metrics such as precision and recall, particularly in imbalanced settings [11,61,63], relying on AUROC alone may obscure clinically relevant deficiencies in minority-class performance. These results underscore the importance of reporting complementary, class-aware metrics, as no single metric adequately captures model performance across different clinical contexts. Therefore, metric selection should be guided by the intended clinical application. In high-risk settings, maximizing recall may be preferable to avoid missing cases, even at the expense of increased false positives. In contrast, in resource-limited settings, higher precision may be prioritized to reduce unnecessary interventions. These trade-offs cannot be adequately captured by a single metric, reinforcing the need for a multidimensional evaluation approach.

Learning curve analysis provided an additional and complementary perspective on model performance. In general, learning curves may reflect either progressive improvement with increasing data or early convergence in relatively simple prediction tasks [60]. In our study, however, the combination of flat learning curves, persistently low minority-class performance, and stable AUROC values across increasing training set sizes suggests limited gains in performance as more data are added, highlighting that AUROC alone may not capture important aspects of model behavior in imbalanced settings.

This finding has relevant methodological considerations. Despite consistently high AUROC values (>0.92), the lack of substantial improvement in performance with increasing training data suggests that discrimination alone may not fully reflect how model performance evolves, particularly for identifying high-risk patients. In this context, AUROC may reflect stable ranking ability driven by dataset characteristics, such as class imbalance, rather than improvements in clinically relevant performance. Therefore, reliance on AUROC alone may lead to overestimation of model performance and potential underrecognition of high-risk patients.

Learning curves offer a complementary tool to assess how model performance changes as additional data are incorporated [60]. When performance remains relatively stable, caution is warranted in interpreting high discrimination metrics as sufficient evidence of model adequacy. Together, these findings reinforce that AUROC alone is insufficient to determine whether a model is suitable for clinical use, particularly in imbalanced scenarios.

International reporting frameworks such as PROBAST+AI and TRIPOD+AI emphasize comprehensive evaluation of predictive model performance and transparency in results presentation but remain focused on global performance measures of discrimination, calibration, and overall clinical usefulness [6,16]. Although these frameworks acknowledge the impacts of class imbalance on outcomes, they do not offer strategies for measuring the redistribution of errors across classes with varying thresholds, nor do they recommend incorporating learning curve analysis into model evaluation [6,16].

Additionally, according to PROBAST+AI, applicability concerns were considered low, as the study population, predictors, and outcomes are consistent with real-world clinical settings. However, this should not be interpreted as evidence supporting clinical use of the models. Despite this apparent applicability, the models demonstrated important limitations in clinically relevant performance, including suboptimal calibration and limited sensitivity for the minority class. This apparent contradiction highlights a key finding of our study: even when models are developed using appropriate data and aligned with clinical contexts, reliance on conventional metrics such as AUROC may obscure critical weaknesses. Therefore, methodological soundness and contextual relevance alone are insufficient to ensure clinically meaningful performance, reinforcing the need for comprehensive, class-aware evaluation frameworks before considering any potential clinical implementation.

Our findings provide a methodological contribution that extends beyond these current standards. Specifically, we demonstrate that high AUROC values were maintained despite limited changes in performance across increasing training set sizes. This observation suggests that AUROC alone may not reflect whether a model has learned clinically meaningful patterns but rather may capture stable discrimination driven by dataset characteristics such as class imbalance, highlighting the need for more transparent and comprehensive reporting.

This has important implications: models may comply with current reporting standards while still underperforming in clinically relevant minority outcomes, which are often the most clinically relevant. Because learning curves are rarely reported, this limitation may go unrecognized in many published models. Incorporating learning curve analysis alongside class-specific metrics can therefore enhance transparency and provide a more robust assessment of model performance.

Resampling techniques, including over- and undersampling, resulted in only modest improvements in minority-class performance and did not resolve the fundamental limitations of AUROC-based evaluation [28,62,64,65]. Even when class distributions were artificially modified, AUROC remained largely insensitive to clinically meaningful changes in recall and precision. This reinforces that rebalancing alone cannot compensate for inappropriate performance metrics.

Our study contributes to literature by bridging theoretical concerns with practical, real-world application. By evaluating 2 predictive models in a large clinical cohort with varying degrees of outcome imbalance, we demonstrate how metric selection and threshold choice directly influence clinical interpretation and link how these shifts directly affect clinical decision-making. By evaluating per-class precision and recall across multiple cutoffs and visualizing these relationships through precision-recall curves, we make explicit the trade-offs inherent to real-world model deployment.

These trade-offs are particularly relevant in health care, where underdiagnosing high-risk patients (low recall) may lead to missed interventions, while overdiagnosis (low precision) can result in unnecessary procedures and resource strain [66]. In high-stakes settings, such as intensive care units or emergency triage, prioritizing recall may be appropriate, even at the cost of more false positives, to avoid missing patients at risk of deterioration. On the other hand, in resource-constrained environments, higher precision may be preferable. Together, these findings reinforce that no single metric is sufficient and that different clinical contexts require different operating points and different emphases on recall, precision, or their balance, which is effectively summarized by the macro-F₁.

Despite growing methodological awareness [4,6-9,50,64], most applied health care research still relies predominantly on this AUROC for model evaluation (Multimedia Appendices 14 and 15) [67]. For example, DynaMed, a widely used evidence-based clinical reference platform, currently lists 26 predictive models specifically developed for COVID-19—1 diagnostic and 25 prognostics, including outcomes such as severe disease progression, thrombosis, intensive care unit admission, KRT, and mortality (Multimedia Appendix 11) [67]. Notably, 88.5% (23/26) of these models primarily report AUROC as the main performance metric [67].

Additionally, some studies report multiple metrics without adequately contextualizing their relevance or the trade-offs involved [67-69]. Our findings highlight the importance of not only reporting multiple metrics but also interpreting them in relation to clinical context and outcome imbalance.

Therefore, metric selection and learning curve analysis may substantially influence the clinical interpretation of model performance. Choosing evaluation strategies that account for outcome imbalance and clinical priorities is essential to support more rigorous evaluation before potential clinical implementation of predictive models.

Limitations

The methodology focused on a single algorithm (XGBoost), although the observed patterns are not model-specific and reflect broader issues related to class imbalance and performance evaluation. External validation was not feasible due to data availability; however, this does not affect the central methodological contribution of the study, which concerns the interpretation of model performance rather than the generalizability of a specific model. Therefore, our findings should not be interpreted as supporting the clinical use of the models presented, given the lack of external validation and suboptimal calibration. Importantly, the aim of this study was not to develop the best-performing predictive model but to examine how evaluation strategies influence the interpretation of model performance in imbalanced clinical scenarios.

In prediction tasks with imbalanced outcomes, which are common in health care, reliance on accuracy and AUROC alone may obscure clinically important failures. Complementary metrics, including precision, recall, and macro-F₁, provide a more realistic assessment of model performance and should be systematically reported. In addition, learning curve analysis offers insight into a model’s learning dynamics and helps explore how model performance evolves as more training data are incorporated. Together, these approaches support a more comprehensive and clinically meaningful evaluation of predictive models, particularly in imbalanced settings, rather than their direct translation into clinical practice.

The authors would like to thank the hospitals which are part of this collaboration for supporting this project. They also thank all the clinical staff at those hospitals, who cared for the patients, and all undergraduate students who helped with data collection. The authors declare the use of generative artificial intelligence (GenAI) in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GenAI tools under full human supervision: assisted linguistic editing and grammatical review. The GenAI tools used were Grammarly, Gemini 1.5, ChatGPT-5.2. The responsibility for the final manuscript lies entirely with the authors. GenAI tools are not listed as authors and do not bear responsibility for the final outcomes.

Funding

This study was supported in part by the Minas Gerais State Agency for Research and Development (Fundação de Amparo à Pesquisa do Estado de Minas Gerais–FAPEMIG) (grant APQ-01154-21), National Institute of Science and Technology for Health Technology Assessment (Instituto de Avaliação de Tecnologias em Saúde–IATS)/National Council for Scientific and Technological Development (Conselho Nacional de Desenvolvimento Científico e Tecnológico–CNPq) (grant 408659/2024-6), and the Center for Innovation and Artificial Intelligence for Health (CI-IA Saúde), which is funded by the São Paulo State Research Support Foundation (FAPESP) (2020/09866-4), FAPEMIG (PPE-00030-21), and UNIMED Belo Horizonte. MSM was partially supported by CNPq (311742/2025-4).

Data Availability

The data generated or analyzed during this study are included in this paper and its Multimedia Appendices 1-18. The corresponding author is available to provide additional data regarding this manuscript upon reasonable request.

Conception and design of the work: VGJV, MAG, MSM

Data collection: VGJV, BPP, CAP, HRV, MSM

Data curation: MSM

Data analysis and interpretation: VGJV, CMVA, JMA, GFN, LCDR, MAG, MSM

Drafting the paper: VGJV, CMVA, JMA, LCDR, MAG, MSM

Writing – review & editing: VGJV, CMVA, JMA, BPP, CAP, GFN, EB, HRV, KPF, LCDR, MAG, MSM

Project administration: MSM

Supervision: MSM

Reading and approving the final version of the manuscript: all authors

None declared.

Abbreviations

AUROC

area under the receiver operating characteristic curve

e2sc-us

Effective, Efficient, and Scalable Confidence-Based-UnderSampling

KRT

kidney replacement therapy

machine learning

PROBAST+AI

Updated Quality, Risk of Bias, and Applicability Assessment Tool for Prediction Models Using Regression or Artificial Intelligence Methods

REDCap

Research Electronic Data Capture

SMOTE

Synthetic Minority Over-Sampling Technique

TRIPOD+AI

Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis + Artificial Intelligence

XGBoost

extreme gradient boosting

References1

Collins

Dhiman

Evaluation of clinical prediction models (part 1): from development to external validation

BMJ2024018384e074819

10.1136/bmj-2023-074819

38191193

Steyerberg

Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating2019

2025-08-22

Springer Nature

https://doi.org/10.1007/978-3-030-16399-0

van Smeden

Reitsma

Riley

Collins

Moons

Clinical prediction models: diagnosis versus prognosis

J Clin Epidemiol202104132142145

10.1016/j.jclinepi.2021.01.009

33775387

Adhikari

Normand

Bloom

Shahian

Rose

Revisiting performance metrics for prediction with rare outcomes

Stat Methods Med Res202110301023522366

10.1177/09622802211038754

34468239

Cabot

Ross

Evaluating prediction model performance

Surgery2023091743723726

10.1016/j.surg.2023.05.023

37419761

Collins

Moons

KGM

Dhiman

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

BMJ20240416385e078378

10.1136/bmj-2023-078378

38626948

Cartus

Samuels

Cerdá

Marshall

BDL

Outcome class imbalance and rare events: an underappreciated complication for overdose risk prediction modeling

Addiction202306118611671176

10.1111/add.16133

36683137

de Paiva

BBM

Pereira

de Andrade

CMV

Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset

Sci Rep20230311313463

10.1038/s41598-023-28579-z

36859446

Liu

Roemer

Comparison of evaluation metrics of deep learning for imbalanced imaging data in osteoarthritis studies

Osteoarthr Cartil20230931912421248

10.1016/j.joca.2023.05.006

D’Agostino

SrVasan

Pencina

General cardiovascular risk profile for use in primary care: the Framingham Heart Study

Circulation200802121176743753

10.1161/CIRCULATIONAHA.107.699579

18212285

Hosmer

Lemeshow

Sturdivant

Applied Logistic Regression2025-08-221

John Wiley & Sons, Inc

https://doi.org/10.1002/9781118548387

10.1002/9781118548387

Iragorri

Spackman

Assessing the value of screening tools: reviewing the challenges and opportunities of cost-effectiveness analysis

Public Health Rev20183917

10.1186/s40985-018-0093-8

30009081

Arnett

Blumenthal

Albert

2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association task force on clinical practice guidelines

Circulation2019091014011e596e646

10.1161/CIR.0000000000000678

30879355

US Preventive Services Task ForceKrist

Davidson

Behavioral counseling interventions to promote a healthy diet and physical activity for cardiovascular disease prevention in adults with cardiovascular risk factors: US Preventive Services Task Force recommendation statement

JAMA202011243242020692075

10.1001/jama.2020.21749

33231670

Rufibach

Use of Brier score to assess binary predictions

J Clin Epidemiol201008638938939

10.1016/j.jclinepi.2009.11.009

20189763

Moons

KGM

Damen

JAA

Kaul

PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods

BMJ20250324388e082505

10.1136/bmj-2024-082505

40127903

Hageman

Pennells

Ojeda

SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe

Eur Heart J2021071422524392454

10.1093/eurheartj/ehab309

Van Calster

McLernon

van Smeden

Wynants

Steyerberg

Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative

Calibration: the Achilles heel of predictive analytics

BMC Med20191216171230

10.1186/s12916-019-1466-7

31842878

Liu

Krentz

Curcin

Machine learning based prediction models for cardiovascular disease risk using electronic health records data: systematic review and meta-analysis

Eur Heart J Digit Health20250161722

10.1093/ehjdh/ztae080

39846062

Liu

Laranjo

Klimis

Machine-learning versus traditional approaches for atherosclerotic cardiovascular risk prognostication in primary prevention cohorts: a systematic review and meta-analysis

Eur Heart J Qual Care Clin Outcomes2023062194310322

10.1093/ehjqcco/qcad017

36869800

Andersen

Birk-Korch

Hansen

Monitoring performance of clinical artificial intelligence in health care: a scoping review

JBI Evid Synth2024121221224232446

10.11124/JBIES-24-00042

39658865

Oettl

Pareek

Winkler

A practical guide to the implementation of AI in orthopaedic research, part 6: how to evaluate the performance of AI research?

J Exp Orthop202407113e12039

10.1002/jeo2.12039

38826500

Hicks

Strümke

Thambawita

On evaluation metrics for medical applications of artificial intelligence

Sci Rep20220481215979

10.1038/s41598-022-09954-8

35395867

Megahed

Chen

Megahed

Ong

Altman

Krzywinski

The class imbalance problem

Nat Methods202111181112701272

10.1038/s41592-021-01302-4

34654918

Lever

Krzywinski

Altman

Classification evaluation

Nat Methods201608138603604

10.1038/nmeth.3945

Kelly

Karthikesalingam

Suleyman

Corrado

King

Key challenges for delivering clinical impact with artificial intelligence

BMC Med20191029171195

10.1186/s12916-019-1426-2

31665002

Kocak

Klontzas

Stanzione

Evaluation metrics in medical imaging AI: fundamentals, pitfalls, misapplications, and recommendations

Eur J Radiol Artif Intell2025093100030

10.1016/j.ejrai.2025.100030

Carriero

Luijken

de Hond

Moons

KGM

van Calster

van Smeden

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

Stat Med20250210443-4e10320

10.1002/sim.10320

39865585

Recommendations for national SARS-cov-2 testing strategies and diagnostic capacities: interim guidance, 25 June 2021

World Health Organization2021

2025-08-22

https://iris.who.int/handle/10665/342002

Marcolino

Ziegelmann

Souza-Silva

MVR

Clinical characteristics and outcomes of patients hospitalized with COVID-19 in Brazil: results from the Brazilian COVID-19 registry

Int J Infect Dis202106107300310

10.1016/j.ijid.2021.01.019

33444752

Yang

Wei

Kidney health in the COVID-19 pandemic: an umbrella review of meta-analyses and systematic reviews

Front Public Health202210963667

10.3389/fpubh.2022.963667

Harris

Taylor

Thielke

Payne

Gonzalez

Conde

Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support

J Biomed Inform200904422377381

10.1016/j.jbi.2008.08.010

18929686

Harris

Taylor

Minor

The REDCap consortium: building an international community of software platform partners

J Biomed Inform20190795103208

10.1016/j.jbi.2019.103208

31078660

Soriano Marcolino

Minelli Figueira

Pereira Afonso Dos Santos

Silva Cardoso

Luiz Ribeiro

Alkmim

The experience of a sustainable large scale Brazilian telehealth network

Telemed J E Health2016112211899908

10.1089/tmj.2015.0234

27167901

Bicalho

MAC

Aliberti

MJR

Delfino-Pereira

Clinical characteristics and outcomes of COVID-19 patients with preexisting dementia: a large multicenter propensity-matched Brazilian cohort study

BMC Geriatr202401524125

10.1186/s12877-023-04494-w

38182982

Chen

Guestrin

XGBoost: a scalable tree boosting system

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Aug 13-17, 2016

San Francisco, CA

785794

10.1145/2939672.2939785

Shwartz-Ziv

Armon

Tabular data: deep learning is not all you need

Inf Fusion202205818490

10.1016/j.inffus.2021.11.011

Wang

Chen

Jin

Che

Prediction of type 2 diabetes risk and its effect evaluation based on the XGBoost model

Healthcare (Basel)2020073183247

10.3390/healthcare8030247

32751894

Wilimitis

Walsh

Practical considerations and applied examples of cross-validation for model development and evaluation in health care: tutorial

JMIR AI202312182e49023

10.2196/49023

38875530

Bradshaw

Huemann

Rahmim

A guide to cross-validation for artificial intelligence in medical imaging

Radiol Artif Intell20230754e220232

10.1148/ryai.220232

37529208

Kohavi

A study of cross-validation and bootstrap for accuracy estimation and model selection

14th International Joint Conference on Artificial Intelligence (IJCAI ’95)

Aug 20-25, 1995

Montreal, Canada

11371145

10.5555/1643031.1643047

Adin

Krainski

Lenzi

Liu

Martínez-Minaya

Rue

Automatic cross-validation in structured models: is it time to leave out leave-one-out?

Spat Stat20240862100843

10.1016/j.spasta.2024.100843

Lemaitre

Nogueira

Aridas

Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning

arXivPreprint posted online on Sep 21, 2016

10.48550/arXiv.1609.06570

Garcia

Bai

ADASYN: adaptive synthetic sampling approach for imbalanced learning

2008 IEEE International Joint Conference on Neural Networks (IJCNN 2008)

Jun 1-8, 2008

Hong Kong, China

13221328

10.1109/IJCNN.2008.4633969

Chawla

Bowyer

Hall

Kegelmeyer

SMOTE: Synthetic Minority Over-sampling Technique

J Artif Intell Res200216321357

10.1613/jair.953

Han

Wang

Mao

Huang

Zhang

Huang

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

Advances in Intelligent Computing2005

2026-01-28

Springer Nature

878887

https://doi.org/10.1007/11538059_91

10.1007/11538059_91

Nguyen

Cooper

Kamei

Borderline over-sampling for imbalanced data classification

Int J Knowledge Eng Soft Data Paradigms2011314

10.1504/IJKESDP.2011.039875

Douzas

Bacao

Last

Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE

Inf Sci (Ny)201810465120

10.1016/j.ins.2018.06.056

Survey of resampling techniques for improving classification performance in unbalanced datasets

arXivPreprint posted online on Aug 22, 2016

10.48550/arXiv.1608.06048

Murphy

Machine Learning: A Probabilistic Perspective2012

MIT Press

0262018020

Powers

DMW

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

J Mach Learn Technol2020

2026-06-17

3763

https://bioinfopublication.org/files/articles/2_1_1_JMLT.pdf

10.9735/2229-3981

Stekhoven

Bühlmann

MissForest--non-parametric missing value imputation for mixed-type data

Bioinformatics2012011281112118

10.1093/bioinformatics/btr597

22039212

Sokolova

Lapalme

A systematic analysis of performance measures for classification tasks

Inf Process Manag200907454427437

10.1016/j.ipm.2009.03.002

Davis

Goadrich

The relationship between precision-recall and ROC curves

Proceedings of the 23rd international conference on Machine learning - ICML ’06

Jun 25-29, 2006

Pittsburgh, PA

233240

10.1145/1143844.1143874

Vickers

Elkin

Decision curve analysis: a novel method for evaluating prediction models

Med Decis Making2006266565574

10.1177/0272989X06295361

17099194

Lipton

Elkan

Naryanaswamy

Optimal thresholding of classifiers to maximize F1 measure

Mach Learn Knowl Discov Databases20148725225239

10.1007/978-3-662-44851-9_15

26023687

Steyerberg

Vickers

Cook

Assessing the performance of prediction models: a framework for traditional and novel measures

Epidemiology (Sunnyvale)201001211128138

10.1097/EDE.0b013e3181c30fb2

20010215

Huang

Macheret

Gabriel

Ohno-Machado

A tutorial on calibration measurements and calibration models for clinical prediction models

J Am Med Inform Assoc2020041274621633

10.1093/jamia/ocz228

32106284

Alba

Agoritsas

Walsh

Discrimination and calibration of clinical prediction models: users’ guides to the medical literature

JAMA201710103181413771384

10.1001/jama.2017.12126

29049590

Viering

Loog

The shape of learning curves: a review

IEEE Trans Pattern Anal Mach Intell20230645677997819

10.1109/TPAMI.2022.3220744

36350870

Simon

Aliferis

Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls2024

2026-04-10

Springer International Publishing

https://doi.org/10.1007/978-3-031-39355-6

10.1007/978-3-031-39355-6

JXC

DhakshinaMurthy

George

Branco

The effect of resampling techniques on the performances of machine learning clinical risk prediction models in the setting of severe class imbalance: development and internal validation in a retrospective cohort

Discov Artif Intell20244191

10.1007/s44163-024-00199-0

39624046

Dinov

Data Science and Predictive Analytics: Biomedical and Health Applications Using R2018

2026-04-10

Springer

https://link.springer.com/book/10.1007/978-3-031-17483-4

Welvaars

Oosterhoff

JHF

van den Bekerom

MPJ

Doornberg

van Haarst

OLVG Urology Consortium, and the Machine Learning Consortium

Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data

JAMIA Open20230762ooad033

10.1093/jamiaopen/ooad033

37266187

van den Goorbergh

van Smeden

Timmerman

Van Calster

The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression

J Am Med Inform Assoc2022081629915251534

10.1093/jamia/ocac093

35686364

Monaghan

Rahman

Agudelo

Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value

Medicina (Kaunas)20210516575503

10.3390/medicina57050503

34065637

Clinical criteria

DynaMed20250822

2026-05-14

https://www.dynamed.com/calculators/#cc-idx

Riley

Pate

Dhiman

Archer

Martin

Collins

Clinical prediction models and the multiverse of madness

BMC Med20231218211502

10.1186/s12916-023-03212-y

38110939

Bozkurt

Aşuroğlu

Mortality prediction of various cancer patients via relevant feature analysis and machine learning

SN Comput Sci202343264

10.1007/s42979-023-01720-5

Multimedia Appendix 1

Potential predictors for patients with COVID-19 undergoing kidney replacement therapy.

Multimedia Appendix 2

Potential predictors for in-hospital mortality in patients with COVID-19.

Multimedia Appendix 3

Final proportions in the training partition after application of each rebalancing strategy.

Multimedia Appendix 4

Hyperparameters evaluated for optimization using extreme gradient boosting (XGBoost).

Multimedia Appendix 5

Definitions and characteristics of the metrics frequently used to evaluate performance in predictive models, using the machine learning and the statistics terminology.

Multimedia Appendix 6

Hypothetical confusion matrix and calculation of performance metrics for binary classification.

Multimedia Appendix 7

Checklist for TRIPOD+AI (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence).

Multimedia Appendix 8

Checklist for PROBAST+AI (Updated Quality, Risk of Bias, and Applicability Assessment Tool for Prediction Models Using Regression or Artificial Intelligence Methods).

Multimedia Appendix 9

Means of performance metrics at different cutoffs for kidney replacement therapy (KRT) and not undergoing KRT.

Multimedia Appendix 10

Decision curve for kidney replacement therapy.

Multimedia Appendix 11

Decision curve for death.

Multimedia Appendix 12

Global and per-class metrics for different rebalancing techniques for kidney replacement therapy.

Multimedia Appendix 13

Global and per-class metrics for different rebalancing techniques for death.

Multimedia Appendix 14

Outcomes and evaluation metrics of predictive scores for patients with COVID-19 based on the DynaMed summary.

Multimedia Appendix 15

Outcomes and evaluation metrics of predictive scores for cardiovascular disease based on the DynaMed summary.

Multimedia Appendix 16

Means of performance metrics at different cutoffs for death and no death.

Multimedia Appendix 17

Features’ importance and contribution to the final predictive model of kidney replacement therapy.

Multimedia Appendix 18

Features’ importance and contribution to the final predictive model of in-hospital mortality.