This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
Unplanned patient readmissions within 30 days of discharge pose a substantial challenge in Canadian health care economics. To address this issue, risk stratification, machine learning, and linear regression paradigms have been proposed as potential predictive solutions. Ensemble machine learning methods, such as stacked ensemble models with boosted tree algorithms, have shown promise for early risk identification in specific patient groups.
This study aims to implement an ensemble model with submodels for structured data, compare metrics, evaluate the impact of optimized data manipulation with principal component analysis on shorter readmissions, and quantitatively verify the causal relationship between expected length of stay (ELOS) and resource intensity weight (RIW) value for a comprehensive economic perspective.
This retrospective study used Python 3.9 and streamlined libraries to analyze data obtained from the Discharge Abstract Database covering 2016 to 2021. The study used 2 sub–data sets, clinical and geographical data sets, to predict patient readmission and analyze its economic implications, respectively. A stacking classifier ensemble model was used after principal component analysis to predict patient readmission. Linear regression was performed to determine the relationship between RIW and ELOS.
The ensemble model achieved precision and slightly higher recall (0.49 and 0.68), indicating a higher instance of false positives. The model was able to predict cases better than other models in the literature. Per the ensemble model, readmitted women and men aged 40 to 44 and 35 to 39 years, respectively, were more likely to use resources. The regression tables verified the causality of the model and confirmed the trend that patient readmission is much more costly than continued hospital stay without discharge for both the patient and health care system.
This study validates the use of hybrid ensemble models for predicting economic cost models in health care with the goal of reducing the bureaucratic and utility costs associated with hospital readmissions. The availability of robust and efficient predictive models, as demonstrated in this study, can help hospitals focus more on patient care while maintaining low economic costs. This study predicts the relationship between ELOS and RIW, which can indirectly impact patient outcomes by reducing administrative tasks and physicians’ burden, thereby reducing the cost burdens placed on patients. It is recommended that changes to the general ensemble model and linear regressions be made to analyze new numerical data for predicting hospital costs. Ultimately, the proposed work hopes to emphasize the advantages of implementing hybrid ensemble models in forecasting health care economic cost models, empowering hospitals to prioritize patient care while simultaneously decreasing administrative and bureaucratic expenses.
An open problem that has arisen in Canadian health care economics is the detrimental cost caused by unplanned patient readmissions in hospitals. North American Hospitals have defined patient readmissions as the admittance of patients within 30 days after discharge [
One of the ways to help reduce patient readmissions is to adopt a preventive approach [
Although deep learning models were used for risk stratification in health care, they had limited success because of the large amount of data required for training [
After determining whether the patients will be readmitted within the next few days, the economic consequences to both the hospital and the patient will be estimated [
The objectives of the proposed work were 3fold. The first, main goal of the project was to implement an ensemble model with individual submodels on the structured data and compare the resulting metrics to metrics resulting from other models that have also explored patient readmission in a heartdisease context. The second goal was to determine the contribution of optimized data manipulation through principal component analysis (PCA) to solving the problem of shorter time frame readmissions. The study also aimed to verify the causal relationship between the ELOS and resource intensity weight (RIW) value. Providing an understanding of this relationship in a quantitative and causal manner can allow for an indepth economic perspective, as opposed to only readmittance within 30 days.
Ultimately, the economic and predictive aspects of this model are intended to provide a view on resource allocation for health institutes to better predict readmittance and improve patientclinician outcomes [
The study used a systematic methodology with Python 3.9 and streamlined libraries to analyze the data obtained from the Discharge Abstract Database (DAD) covering 2016 to 2021 [
Study workflow: data collection (blue), data preparation and machine learning implementation (orange), and outputs (green). DAD: Discharge Abstract Database; PCA: principal component analysis.
Similar to the study by Baruah [
A clinical preprocessing step was performed to isolate for specific criteria and remove any potential confounding variables. Arbitrary admission and discharge dates were chosen based on previous calculations to avoid errors or inconsistencies in the data set. To ensure that the minimum number of relative admission dates was ≥0, dates were shifted to a minimum of January 5 of the corresponding data set year. This adjustment enabled the creation of the “LTORET30Days” columns. For feature selection and dimensionality reduction, PCA was used, as it was a common methodology used for highdimensionality data sets.
Clinical workflow: data collection (blue), data preparation and machine learning implementation (orange), implementations (purple), and outputs (green). LGBM: LightGBM; PCA: principal component analysis.
According to the PCA criterion, the components to use were described by the minimum number of features required to obtain a cumulative variance of at least 80% [
LGBM and XGB presented a relative advantage with regard to efficient computation and high accuracy on a wide range of data sets, including those with high dimensionality and categorical features [
Random forest was chosen to improve the interpretability of the model when used in conjunction with PCA [
The ensemble model used in this study was a stacking classifier model with a metamodel (final estimator), which was a logistic regression model [
To optimize the performance of each base model, hyperparameter tuning was done using a range of values for each parameter [
In addition, a custom function was used to optimize the final estimator of the stacking model, specifically for the logistic regression component [
Tuned parameters organized according to submodels and estimators.
Model  Parameters 
XGB^{a}  max_depth, n_estimators, and learning_rate 
Random forest  bootstrap and max_depth 
LGBM^{b}  learning_rate, n_estimators, num_leaves, min_child_samples, subsample, max_depth, colsample_bytree, reg_alpha, reg_lambda, and min_data_in_leaf 
Logistic regression (stacking ensemble)  solver, penalty, and C 
^{a}XGB: XGBoost.
^{b}LGBM: LightGBM.
Statistical analysis was performed to ensure that the model was robust in and valid for improving the patient outcomes. Three evaluation metrics were used to evaluate the robustness of the model.
Precision is the ratio between the true positive observations and total positive observations obtained from the confusion matrix [
Recall is the ratio between the number of true positives and the sum of the number of true positives and number of false negatives [
Balancing the 2 quantities required the use of
All the scores for the hyperparametertuned data were plotted on a bar graph to ensure a clear presentation of the data [
To determine the relationship between ELOS and RIW, 2 continuous variables that have been shown to be positively correlated with improved patient outcomes, a linear regression analysis was conducted [
To ensure that the results were not biased by confounding factors, the linear regression analyses were conducted separately for each age group, gender, and readmission column class [
Geographic workflow: data collection (blue), data preparation and regression implementation (orange), and outputs (green). MCC: major complication or comorbidity.
After the data were isolated for individuals aged >18 years and the MCC codes, Python pandas were used to condition the data set onto covariates. The entire data set was then placed into specific clusters based on this condition. First, individuals were clustered according to whether they had the same patient readmission column value, and then they were split by gender. Afterward, each data point was separated into age clusters. There were 2 gender data clusters for each of the 2 readmitted clusters and 15 age clusters for each of the 4 resulting clusters, resulting in 60 linear regressions being performed. To clarify, the main independent variable was ELOS, and the dependent variable was RIW. The data were split to verify the hypothesis that there was indeed an economic benefit to extending a patient’s length of stay rather than being readmitted.
This study was exempt from research ethics review, as it was a secondary analysis of research data. As data were received directly from acute care facilities or from their respective health or regional authority or ministry or department of health, facilities in all provinces and territories except Quebec were required to report. The authors do not claim any right to the data, as they are the property of Statistics Canada along with the Abacus Student Network [
The results of the main study are presented in this section. The results for the PCA, feature selection stages, and more data can be found in section B in
The evaluation metrics for the ensemble model were presented using classification reports (
Classification reports for different models.^{a}
Model type and class  Precision  Recall  



0^{b}  0.92  0.99  0.95 

1^{c}  0.79  0.31  0.44 



0  0.93  0.97  0.95 

1  0.65  0.39  0.48 



0  0.96  0.91  0.93 

1  0.49  0.68  0.57 



0  0.96  0.91  0.93 

1  0.49  0.68  0.57 
^{a}All of these models have been hyperparameter tuned.
^{b}For all models, class 0 contains n=16,592.
^{c}For all models, class 1 contains n=2079.
^{d}Tuned submodels and tuned ensemble models.
A least squares linear regression model was fitted to the ELOS and RIW value columns of a geographical data set, and a summary of the bestfitted lines was obtained (
Regression lines fitted for women who were readmitted within 30 days, separated by age groups.
Age group (years)  Slope (expected length of stay)  Intercept  RMSE^{a}  Sample size, n  
1824  0.196661  –0.038496  0.469363  1.222708  67.339641  76 
2529  0.236855  –0.032451  0.482341  2.067290  128.653235  138 
3034  0.195582  0.009696  0.308630  2.507248  69.299619  154 
3539  0.455326  –1.431780  0.587717  3.356237  213.402169  150 
4044  0.776407  –3.014587  0.573208  6.474491  284.385704  212 
4549  0.355912  –0.681188  0.626838  2.286141  572.132763  341 
5054  0.338375  –0.350636  0.656287  1.246735  990.071951  519 
5559  0.269042  0.021663  0.500513  1.631754  776.589838  775 
6064  0.266451  0.023355  0.401983  1.507652  755.872164  1124 
6569  0.361558  –0.433117  0.562777  1.813054  1821.045541  1415 
7074  0.346215  –0.373987  0.455203  2.111974  1393.857933  1668 
7579  0.280143  –0.088512  0.499735  1.388378  1795.095118  1797 
>80  0.280403  –0.140523  0.321200  1.600127  1975.140727  4173 
^{a}RMSE: root mean squared error.
The 4 types of submodels.
Ensemble model  Tuned submodels (Y^{a} or N^{b})  Tuned LR^{c} (Y or N) 
1  N  N 
2  N  Y 
3  Y  N 
4  Y  Y 
^{a}Y: yes.
^{b}N: no.
^{c}LR: logistic regression.
Comparison of existing literature values.
Author name or literature values  Description of model  Comparison to current literature values with precision, recall, and 
Sharma et al [ 
Sharma et al’s [ 
Sharma et al [ 
Jamei et al [ 
Jamei et al [ 
The following scores were given for the 2layer neural network of Jamei et al [ 
Ho et al [ 
Ho et al [ 
The following scores were present in the readmission stage: recall score of 80% and a precision score of 76%. Although these scores may be higher overall due to the presence of more personalized data such as specific laboratory results for each patient. Furthermore, Ho et al [ 
^{a}AUC: area under the curve.
^{b}ANN: artificial neural networks.
Regression lines fitted for men who were not readmitted within 30 days, separated by age groups.
Age group (years)  Slope (expected length of stay)  Intercept  RMSE^{a}  Sample size, n  
1824  0.133004  0.603021  0.301397  1.165577  228.362523  528 
2529  0.332606  –0.276038  0.614197  1.609255  718.990166  452 
3034  0.492425  –1.069907  0.697588  2.279228  1521.145371  660 
3539  0.447525  –0.844292  0.653325  2.407347  1902.504340  1010 
4044  0.466519  –0.840117  0.647550  2.075903  3118.868054  1698 
4549  0.380460  –0.308439  0.583264  1.770296  3957.666828  2828 
5054  0.420954  –0.557944  0.626193  2.074092  7922.913006  4730 
5559  0.410799  –0.440937  0.570193  2.223704  9108.272364  6866 
6064  0.421378  –0.502756  0.471093  2.769979  7587.014726  8518 
6569  0.341375  –0.053999  0.514460  2.144153  10,030.832915  9467 
7074  0.327909  –0.015719  0.476918  1.992737  8977.140732  9846 
7579  0.331579  –0.114296  0.447201  2.177262  7031.801936  8692 
>80  0.296201  –0.082061  0.269552  2.414906  6006.483269  16,275 
^{a}RMSE: root mean squared error.
Regression lines fitted for men who were readmitted within 30 days, separated by age groups.
Age group (years)  Slope (expected length of stay)  Intercept  RMSE^{a}  Sample size, n  
1824  0.13304  –0.123251  0.659408  1.936411  159.756983  83 
2529  0.434780  –1.444296  0.563547  3.009955  123.663857  96 
3034  0.205049  0.338259  0.414037  1.963718  109.108782  154 
3539  0.503076  –1.434408  0.597649  3.516315  387.201489  261 
4044  0.319638  –0.426751  0.633530  1.556234  708.053550  410 
4549  0.321520  –0.251261  0.543536  2.238206  883.346712  742 
5054  0.305131  –0.157782  0.546973  1.329511  1546.437115  1281 
5559  0.336327  –0.291609  0.516734  1.844321  1963.077772  1836 
6064  0.387830  –0.541003  0.524773  1.978364  2742.875413  2484 
6569  0.356209  –0.334169  0.526643  1.919789  3103.957220  2790 
7074  0.315150  –0.190539  0.503276  1.686930  2989.908128  2951 
7579  0.330981  –0.242378  0.515653  1.935420  2699.848355  2536 
>80  0.320212  –0.272116  0.388353  1.864184  2708.979941  4266 
^{a}RMSE: root mean squared error.
Regression lines fitted for women who were not readmitted within 30 days, separated by age groups.
Age group (years)  Slope (expected length of stay)  Intercept  RMSE^{a}  Sample size, n  
1824  0.290987  –0.168320  0.650458  1.837020  733.188372  306 
2529  0.340779  –0.378737  0.570807  2.726389  589.170872  445 
3034  0.324827  –0.284349  0.584901  2.048568  789.076632  562 
3539  0.368889  –0.515399  0.631981  1.725110  1102.474089  644 
4044  0.253023  0.147074  0.531883  1.569319  1070.319261  944 
4549  0.324630  –0.232875  0.627709  1.493320  2394.218647  1422 
5054  0.301186  –0.073365  0.478817  1.655429  2019.326558  2200 
5559  0.389959  –0.484006  0.576303  1.886088  4468.179114  3287 
6064  0.339517  –0.190579  0.422514  2.111166  3013.638447  4121 
6569  0.297055  –0.029449  0.504601  1.616991  5405.580063  5309 
7074  0.333896  –0.262753  0.482080  2.016495  5622.958698  6043 
7579  0.348289  –0.358721  0.439302  2.057815  5139.697471  6562 
>80  0.266803  –0.034627  0.179332  2.489782  4115.368718  18,835 
^{a}RMSE: root mean squared error.
The proposed work aimed to use ensemble models and linear regressions for predicting patient readmissions and analyzing their economic consequences [
Although the study used cuttingedge algorithms for classification and regression, there are several critical notes that must be considered [
Another crucial consideration is the computational cost associated with clinical and graphical data [
In addition, some features in the geographical data, such as the case mix group diagnosis type, could not be split in the geographical data sets because of their high computational cost. This could lead to omitted variable bias and negatively affect the models’ accuracy [
In this section, the clinical data set results are analyzed and compared with those of other existing models in the literature.
The use of PCA offered several advantages. The selection of the components that describe the minimum number of features required to achieve a cumulative variance of at least 80% proved to be effective in preventing overfitting [
Moreover, PCA eliminated the potential for collinearity, which can create unstable and unreliable estimates of the model parameters [
Furthermore, the implementation of PCA in conjunction with stacked classifiers enabled a higher interpretability of the models [
This study found that although the hyperparametertuned XGB model outperformed its base model, it was still less accurate than the other individual submodels. This result is consistent with a previous study conducted in Alberta that also found that XGB models did not provide substantial information on patient readmissions [
By contrast, both the tuned random forest and LGBM models (
The ensemble model was created to ensure minimization and offset bias and variance between each of the models in discussion [
Upon analyzing the data, it was observed that the default model configuration, which consisted of default submodels and a default final estimator logistic regression, exhibited high precision (0.92) and recall (0.98) for nonreadmitted patients (class 0). However, its ability to predict readmissions (class 1) was comparatively weaker, as evidenced by the lower
The second configuration, which used default submodels with a tuned logistic regression final estimator, demonstrated an improvement in the
The third configuration, which used tuned submodels with a default final estimator logistic regression, yielded high precision (0.92) and recall (0.99) for class 0. However, its performance in predicting readmissions (class 1) was weaker, with a precision of 0.77 and recall of 0.30, leading to an
The fourth configuration, in which both submodels and final estimator logistic regression were tuned, resulted in the highest
The overall tuned ensemble model, when compared with the submodels, is identical to the LGBM model, as although recall is favored, the balance between precision and recall for class 1, compared with the other models, is useful in preventing too many false positives from occurring.
The results of this study are not comparable with Baruah’s [
Note that this list is not exhaustive and that there may be other studies that potentially use stacking classifier models and show better results. The comparison with other studies shows that the model has the potential to be viable and robust, but more tuning and comparison between submodels need to be performed.
One notable limitation of the clinical data set used in this study was the high class imbalance problem. Specifically, there were considerably more training points for class 0 than for class 1, with n=83,083 for class 0 and n=10,271 for class 1. This issue could have led to the trained model being more prone to producing false negatives than to producing false positives, as it was more familiar with class 0 instances and thus had a tendency to classify more instances as class 0 [
Another limitation of the data set was the encoding of the data, which could have influenced the interpretability and accuracy of the model. Specifically, if the model interpreted the encoded data as ordinal, it could have altered the ordinality of the classifier, thereby influencing the classification results. This limitation could have impacted the ability of the model to identify the most relevant features for predicting patient readmission, reducing its interpretability [
Finally, the data set’s lack of information about the specific principal component that contributed to the accurate prediction of the patient data set was another limitation. This limitation could have constrained the model’s ability to explain how the variables were associated with patient readmission, resulting in a lack of transparency in the model’s predictions and reduced ability to elucidate the rationale behind its decisionmaking process. As such, identifying the principal components that contribute to the accurate prediction of the patient data set is critical to improving the interpretability and reliability of the model.
The study results suggested that the model could potentially establish a causal relationship (albeit with a proper regression type) between ELOS and RIW. The anticipated hypothesis was well supported by the tables presented earlier, indicating the importance of the model. The analysis involved an explicit model of a continuous outcome (RIW) that was affected by a measured continuous variable (ELOS), and the results showed a notable impact. This finding encourages the establishment of causality in the relationship between ELOS and RIW.
The relationship between ELOS and RIW was investigated through a linear regression analysis, which produced the coefficient (slope) from the ELOS variables. The study findings indicated that more resources were expended and more time was spent among women aged 40 to 44 years who were readmitted than among those who were not readmitted. In addition, more resources were expended for men aged 35 to 39 years (
The
However, the root mean squared error and
Many fundamental aspects of both the ensemble model and linear regression remain unexplored.
Therefore, the suggested future implementations for the ensemble model are as follows:
Including unstructured data (such as clinical data and text notes) in analysis by a deep neural network and performing logistic regression on all the models to give individuality to a specific patient [
Using deep learning neural networks as a final estimator for the ensemble model and outputting evaluation metrics [
Adding more submodels and optimizing for computational resources such as space, time, and memory [
The suggested improvements for the linear regression include are as follows:
An instrumental variable that measures the relationship between ELOS and a selection decision variable should be implemented. The instrumental variables should only be involved in the selection decision process. Afterward, the relationship between RIW and the selection decision variable should be measured to ensure low omitted variable biases [
Logistic regression (logistic by the coefficients) should be performed to ensure that root mean squared error is minimized and a more accurate relationship between the ELOS and RIW can be derived [
These applications can allow for a more indepth analysis and provide a multifaceted perspective in the fields of ML, econometrics, and health care interventions.
The study’s implications are to validate the use of hybrid ensemble models and attempt to predict economic cost prediction models. The availability of robust and efficient predictive models, such as the one presented in this study, can enable hospitals to focus more on patients and less on the utility and bureaucratic costs associated with their readmission. As demonstrated by the evaluation metrics, the ensemble model plays a critical role in ensuring more precise results overall. By implementing a crowdsourcing approach, the model can also estimate the resources required to control future epidemics in an easier, timesensitive manner while maintaining low economic costs. This is particularly relevant in decentralized, universal, publicly funded countries such as Canada, where high inflation on medical equipment, technologies, and maintenance has been observed in the aftermath of the COVID19 pandemic.
Predicting the relationship between ELOS and RIW can also indirectly predict patient outcomes by reducing bureaucratic and utility costs, thereby reducing the cost burden placed on patients to implement administrative tasks and on physicians to ensure their execution. The ensemble model also considers the specific disease type, and the encoding process has resulted in the classification data being ordinal in nature, which takes into account patient utility in addition to risk stratification.
The linear regression has considered the differences in continuous variables while also allowing for a clear difference in the clustered groups. Further exploration of the costbenefit economic model can enable hospitals to ensure more costfree, patientfriendly outcomes. It is recommended that after making several changes to the general ensemble model and the linear regressions, they be used to analyze new and incoming numerical hospital cost data.
Indepth descriptions and mathematical formalisms for all of the submodels and ensemble models, cumulative variance and principal component analysis results and confusion matrices of all of the previous models and graphs for the linear regression, all the codes, and references.
Discharge Abstract Database
expected length of stay
International Classification of Diseases, 10th revision
LightGBM
major complication or comorbidity
machine learning
principal component analysis
resource intensity weight
XGBoost
The authors express their profound gratitude to Adrian Stanley from JMIR for his unwavering support and to Benjamin D Fedoruk from STEM Fellowship for his invaluable assistance in ideation and manuscript. This research would not have been possible without the generous support of the sponsors from the 2022 InterUniversity Big Data Challenge, including JMIR Publications, Roche, Statistical Analysis System Institute Inc, Canadian Science Publishing, Digital Science, and Overleaf, whose contributions enabled the authors to conduct this groundbreaking research.
This manuscript received first place in the STEM Fellowship Big Data Challenge InterUniversity Innovation Award, which was sponsored by JMIR Publications. JMIR Publications provided APF support for the publication of this paper.
All codes have been made available by the authors in
ER assumed leadership in the administration, drafting, and computational efforts for this manuscript. KN and QG made equal contributions to the programming and algorithm development, demonstrating their expertise and commitment to this project. SR and JP contributed equally to the drafting of the manuscript.
None declared.