This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
Knee osteoarthritis (OA) is the most common form of OA and a leading cause of disability worldwide. Chronic pain and functional loss secondary to knee OA put patients at risk of developing depression, which can also impair their treatment response. However, no tools exist to assist clinicians in identifying patients at risk. Machine learning (ML) predictive models may offer a solution. We investigated whether ML models could predict the development of depression in patients with knee OA and examined which features are the most predictive.
The primary aim of this study was to develop and test an ML model to predict depression in patients with knee OA at 2 years and to validate the models using an external data set. The secondary aim was to identify the most important predictive features used by the ML algorithms.
Osteoarthritis Initiative Study (OAI) data were used for model development and external validation was performed using Multicenter Osteoarthritis Study (MOST) data. Forty-two features were selected, which denoted routinely collected demographic and clinical data such as patient demographics, past medical history, knee OA history, baseline examination findings, and patient-reported outcome measures. Six different ML classification models were trained (logistic regression, least absolute shrinkage and selection operator [LASSO], ridge regression, decision tree, random forest, and gradient boosting machine). The primary outcome was to predict depression at 2 years following study enrollment. The presence of depression was defined using the Center for Epidemiological Studies Depression Scale. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) and F1 score. The most important features were extracted from the best-performing model on external validation.
A total of 5947 patients were included in this study, with 2969 in the training set, 742 in the test set, and 2236 in the external validation set. For the test set, the AUC ranged from 0.673 (95% CI 0.604-0.742) to 0.869 (95% CI 0.824-0.913), with an F1 score of 0.435 to 0.490. On external validation, the AUC varied from 0.720 (95% CI 0.685-0.755) to 0.876 (95% CI 0.853-0.899), with an F1 score of 0.456 to 0.563. LASSO modeling offered the highest predictive performance. Blood pressure, baseline depression score, knee pain and stiffness, and quality of life were the most predictive features.
To our knowledge, this is the first study to apply ML classification models to predict depression in patients with knee OA. Our study showed that ML models can deliver a clinically acceptable level of performance (AUC>0.7) in predicting the development of depression using routinely available demographic and clinical data. Further work is required to address the class imbalance in the training data and to evaluate the clinical utility of the models in facilitating early intervention and improved outcomes.
Knee osteoarthritis (OA) is the most common form of OA and a leading cause of disability worldwide, with global prevalence estimated at 16% for individuals aged 15 years and over [
Several studies suggest that depression has an adverse impact on OA prognosis, quality of life, pain levels, as well as treatment effectiveness [
Unsurprisingly, patients with knee OA and comorbid depression report lower coping ability, which translates into more frequent medical help-seeking and reduced satisfaction from treatment, including surgical interventions such as knee arthroplasty [
Obtaining adequate mental health support should be of primary importance, as the presence of depressive symptoms is a significant predictor of worsening outcomes [
Identifying patients with depression early would be helpful; however, no such tools currently exist. Although one previous study has tried to predict depression in this patient population, the model was based on conventional statistical methods, had low accuracy (area under the receiver operating characteristic curve [AUC]=0.742, 95% CI 0.622-0.862), and lacked external validation [
The primary objective of this study was to apply ML models to predict depression in patients with knee OA, using routinely available clinical data. We hypothesized that ML models can deliver a clinically acceptable level of performance, defined as an AUC greater than 0.7. Our secondary objective was to identify the most important predictive features used by the ML algorithms to make this prediction.
We used data from the Osteoarthritis Initiative (OAI) database for model development and data from the Multicenter Osteoarthritis Study (MOST) for external validation. Both are publicly available, prospective cohort studies investigating knee OA progression in the US population [
We included patients who attended the baseline and 15-month/24-month follow-ups, with preexisting knee OA (defined as the presence of symptoms and radiographic evidence of OA) or at high risk of developing knee OA (symptoms of pain, stiffness, and swelling). Patients with a history of rheumatoid arthritis, missing data for the depression scale scores at either consultation, missing radiographic data, missing baseline examination findings, or missing patient-reported outcome measures were excluded.
No ethical approval was required for this study owing to the open access nature of the OAI and MOST databases.
Our primary outcome was the development of depression at 2 years following enrollment in the database. Depression was defined using the Center for Epidemiological Studies Depression Scale (CES-D), which is based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition formulation of depression, containing 20 questions evaluating the severity of psychosomatic symptoms [
In the MOST, follow-up visits were scheduled at different time points compared with those used in the OAI study, and therefore CES-D scores captured during the 15-month visit were used for external validation.
Variable selection was guided by the literature and clinical relevance as judged by the senior author who is a specialist in the field. To facilitate external validation, equivalent variables had to be available in both the OAI and MOST data sets. In total, there were 2532 baseline variables in the OAI database and 1842 baseline variables in the MOST database; 70 and 66 variables were selected from the respective databases for model development. Variables included information on patient demographics, past medical history, knee OA history, baseline examination findings, and baseline patient-reported outcome measures.
Patient demographics included age, sex, ethnicity, BMI, marital status, living arrangements, current employment, education, and smoking status. Past medical history encompassed the history of heart attack, heart failure, stroke, asthma, chronic obstructive pulmonary disease, peptic ulcer disease, diabetes, kidney disease, and osteoporosis medication. Variables relating to knee OA history consisted of past knee injury, past knee surgery, steroid knee injections, analgesic medication for knee pain, as well as other arthritis medication. Baseline examination findings covered systolic and diastolic blood pressure, medial and lateral tibiofemoral, Kellgren-Lawrence grade, the 20-meter-walk test, the five-times-sit-to-stand test, and baseline CES-D score. Patient-reported outcome measures were the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Physical Activity Scale for the Elderly (PASE), and 12-item Short-Form Health Survey (SF-12).
Smoking status was stratified according to smoking intensity into light (1-5 pack-year history of smoking), moderate (10-20 pack-years), or severe (>20 pack-years). BMI was grouped into underweight (BMI<18.5 kg/m2), normal weight (BMI 18.5-24.9 kg/m2), overweight (BMI 25-29.9 kg/m2), and obese (BMI>30 kg/m2), as defined by the World Health Organization [
Feature engineering involves the combination of separate variables into a new, “engineered” feature, based on domain expertise and literature evidence. This action decreases the number of separate features and has been shown to improve model performance [
Summary of all features included in the model training.
Feature category | Features |
Patient demographics | Age, sex, BMI, ethnicity, employment status, education status, living alone, marital status, smoking status |
Past medical history and medication | Heart attack, heart failure, stroke, asthma, chronic obstructive pulmonary disease, peptic ulcer disease, diabetes, kidney disease, osteoporosis medication |
Knee osteoarthritis history | Knee arthroscopy, knee meniscectomy, ligament repair, other knee surgery, arthritis of other joints, knee injury, steroid knee injections, analgesic medication for knee osteoarthritis, arthritis medication |
Baseline examination findings | Blood pressure, 20-meter-walk test, five-stands-to-sit test, KLGa,b, CES-Dc baseline |
Patient-reported outcome measures | WOMACa,d (Total, Pain score, Stiffness score); SF-12e (Physical components, Mental health component); PASEf |
aSeparate feature for the right and left knee.
bKLG: Kellgren-Lawrence Grade.
cCES-D: Center for Epidemiological Studies Depression Scale.
dWOMAC: Western Ontario and McMaster Universities Osteoarthritis Index.
eSF-12: 12-item Short Form Health Survey.
fPASE: Physical Activity Scale for the Elderly.
Missing values in the OAI data set were addressed by coding them as “unknown” to match the MOST data set. Following this imputation, only patients with all observations completed were included for analysis.
Flowchart summarizing the project timeline and steps of model development. AUC: area under the receiver operating characteristic curve; GBM: gradient boosting machine; LASSO: least absolute shrinkage and selection operator; MOST: Multicenter Osteoarthritis Study; OAI: Osteoarthritis Initiative.
Logistic regression is a statistical model that uses a logit function to predict the probability of an observation belonging to the positive class [
LASSO and ridge regression models are based on the logistic regression model [
Decision tree is a simple, tree-shaped algorithm, in which each branch of the tree determines a possible decision or course of action [
In GBM, multiple tree-based classifiers are trained to augment each other and to reduce the prediction error [
The overall model performance was evaluated on the previously unseen OAI test set and externally validated using the MOST data set.
The primary model performance criterion was the AUC, and we considered an AUC greater than 0.7 to indicate clinically acceptable performance [
While ML may provide a valuable predictive tool, the clinical implementation often raises concerns due to the model’s complexity, referred to as the “black-box” problem [
The initial OAI data set included 4796 patients (
Summary of patient flow for both databases. CES-D: Center for Epidemiological Studies Depression Scale.
Key patient demographic and clinical data.
Characteristic | OAIa (n=3711) | MOSTb (n=2236) | |||
Age, mean (SD) | 61.0 (9.1) | 62.1 (8.1) | |||
BMI, mean (SD) | 28.4 (4.8) | 30.4 (5.9) | |||
Sex (female), n (%) | 2149 (57.91) | 1297 (58.01) | |||
Ethnicity (white), n (%) | 3082 (83.05) | 1932 (86.40) | |||
Blood pressure (hypertension stage≥1), n (%) | 1847 (49.77) | 1008 (45.08) | |||
Other arthritis, n (%) | 1454 (39.18) | 1071 (47.90) | |||
Analgesic medication for knee OAc (any), n (%) | 845 (22.77) | 1804 (80.68) | |||
|
|||||
|
Right knee, grade 1 or higher | 2294 (61.82) | 1180 (52.77) | ||
|
Left knee, grade 1 or higher | 2206 (59.44) | 1264 (56.53) | ||
|
|||||
|
Right knee | 10.7 (10.3) | 18.6 (17.5) | ||
|
Left knee | 10.7 (10.4) | 18.3 (17.5) | ||
Baseline CES-Df, mean (SD) | 6.3 (6.0) | 6.7 (6.2) | |||
Depression at 2-year visit, n (%) | 342 (9.22) | 265 (11.85) |
aOAI: Osteoarthritis Initiative.
bMOST: Multicenter Osteoarthritis Study.
cOA: osteoarthritis.
dKLG: Kellgren-Lawrence Grade.
eWOMAC: Western Ontario and McMaster Universities Osteoarthritis Index.
fCES-D: Center for Epidemiological Studies Depression Scale.
In total, six classification models were trained using all 42 features. The results for each model are summarized in
The accuracy, precision, recall, and F1 scores for the test and validation sets are summarized in
Model performance for the internal test set and external validation set.
Ranka | Model | Test set (OAIb), AUCc (95% CI) | External validation set (MOSTd), AUC (95% CI) |
1 | LASSOe | 0.869 (0.824-0.913) | 0.876 (0.853-0.899) |
2 | GBMf | 0.858 (0.813-0.903) | 0.872 (0.849-0.895) |
3 | Ridge | 0.864 (0.818-0.910) | 0.852 (0.827-0.878) |
4 | Random forest | 0.808 (0.741-0.874) | 0.822 (0.790-0.853) |
5 | Logistic regression | 0.837 (0.786-0.888) | 0.808 (0.775-0.840) |
6 | Decision tree | 0.673 (0.604-0.742) | 0.720 (0.685-0.755) |
aModels are ranked by their performance on the external validation data set.
bOAI: Osteoarthritis Initiative.
cAUC: area under the receiver operating characteristic curve.
dMOST: Multicenter Osteoarthritis Study.
eLASSO: least absolute shrinkage and selection operator.
fGBM: gradient boosting machine.
AUC plot of all models tested on the OAI test set (20% of the initial OAI data set). The test set was not used at any stage of model training. AUC: area under the receiver operating characteristic curve; GBM: gradient boosting machine; LASSO: least absolute shrinkage and selection operator; MOST: Multicenter Osteoarthritis Study; OAI: Osteoarthritis Initiative.
AUC plot of all models externally validated on the MOST data set. AUC: area under the receiver operating characteristic curve; GBM: gradient boosting machine; LASSO: least absolute shrinkage and selection operator; MOST: Multicenter Osteoarthritis Study; OAI: Osteoarthritis Initiative.
Accuracy, precision, recall, and F1 scores for the test set, ranked by the F1 score.
Rank | Model | Accuracy | Precision | Recall | F1 |
1 | LASSOa | 0.902 | 0.467 | 0.515 | 0.490 |
2 | Random forest | 0.923 | 0.628 | 0.397 | 0.486 |
3 | Logistic regression | 0.906 | 0.485 | 0.485 | 0.485 |
4 | GBMb | 0.901 | 0.466 | 0.500 | 0.482 |
5 | Decision tree | 0.895 | 0.429 | 0.441 | 0.435 |
6 | Ridge | 0.908 | 0.500 | 0.426 | 0.460 |
aLASSO: least absolute shrinkage and selection operator.
bGBM: gradient boosting machine.
Accuracy, precision, recall, and F1 scores for the validation set, ranked by the F1 score.
Rank | Model | Accuracy | Precision | Recall | F1 |
1 | LASSOa | 0.889 | 0.528 | 0.604 | 0.563 |
2 | Decision tree | 0.890 | 0.538 | 0.536 | 0.537 |
3 | GBMb | 0.865 | 0.453 | 0.657 | 0.536 |
4 | Random forest | 0.894 | 0.556 | 0.506 | 0.530 |
5 | Logistic regression | 0.886 | 0.344 | 0.698 | 0.461 |
6 | Ridge | 0.895 | 0.593 | 0.370 | 0.456 |
aLASSO: least absolute shrinkage and selection operator.
bGBM: gradient boosting machine.
The most important predictive features identified by LASSO were blood pressure, CES-D score at baseline, total WOMAC score for both knees, and mental and physical components of the SF-12 survey. Blood pressure had the highest coefficient (0.173), followed by the baseline CES-D score (0.126), WOMAC total for the right knee (0.004), and WOMAC total for the left knee (0.003). The mental and physical components of SF-12 had negative coefficients (–0.032 and –0.009, respectively).
The results of this study demonstrate that it is possible, with high accuracy, to predict depression in patients with knee OA using a variety of routinely collected data such as patient demographics, medical history, examination findings, and patient-reported outcome measures. The developed ML models achieved clinically relevant discrimination between depressed and nondepressed patients, with LASSO identified as the best-performing model, yielding an AUC of 0.876 (95% CI 0.853-0.899) on external validation. The accuracies for external validation were high, ranging from 0.865 (GBM) to 0.895 (ridge), meaning that between 86.5% and 89.5% of all patients were correctly classified. However, the F1 scores ranged from 0.456 (ridge) to 0.563 (LASSO). Low F1 scores despite high accuracy implies that the models can identify patients without depression more accurately than those with depression. This is likely due to class imbalance in the data set, which is a common problem in medical research that results in predictive modeling bias toward the majority [
While ML may provide a valuable predictive tool, the clinical implementation often raises concerns due to model complexity, often referred to as the “black-box” problem [
ML predictive models have an important role in augmenting clinical judgment, and when compared with standard predictions, they produce more accurate and less variable risk estimates [
The advantage of our models lies in their simplicity as they rely on easily accessible clinical information. In addition, LASSO identified only 6 features to be crucial for prediction, making the model more practical. Blood pressure is routinely measured by primary health care practitioners, and WOMAC, SF-12, and CES-D scores are commonly used patient-reported outcome measures [
To the best of our knowledge, this is the first study applying ML to predict depression in patients with knee OA. One previous study attempted to develop a prediction model based on logistic regression using conventional statistical methods [
Diagnosis of depression is challenging in clinical practice, and ML models have been previously applied to predict illness in different patient populations [
Our study is strengthened by the use of a large patient cohort for model development, testing, and validation. The list of input features was carefully curated, with selection based on literature evidence, domain expertise, and data completeness. In addition, our predictive models were externally validated and performed well in an independent cohort, demonstrating their generalizability and potential for clinical application. Notably, LASSO identified only six features to be crucial for prediction, which showcases the simplicity of our method and the ease with which this tool could be used in a clinical setting.
Several limitations should be addressed in future research. First, the study sample used for model development might not be representative of a general population of patients with knee OA. The prevalence of depressed patients in the training set was 9.2%, which is much lower than the 20% rate previously suggested by the literature [
This is the first study to apply ML classification models to predict depression in patients with knee OA using routinely collected patient data. The LASSO model offered the highest quality of prediction, with an AUC of 0.876 (95% CI 0.853-0.899) on external validation. The advantages of our method include the use of a large patient cohort and routinely collected data, as well as external validation on an independent data set. This tool offers a potential opportunity to assess a patient’s risk of future depression, facilitating early intervention. Further research is required to establish where such a tool would fit within the care pathway, and while the harmful effects of depression on knee OA are well documented, it will be necessary to confirm that early detection and management of depression in this population leads to the expected improvement in outcomes.
area under the receiver operating characteristic curve
Center for Epidemiological Studies Depression Scale
gradient boosting machine
least absolute shrinkage and selection operator
machine learning
Multicenter Osteoarthritis Study
osteoarthritis
Osteoarthritis Initiative
Physical Activity Scale for the Elderly
12-item Short Form Health Survey
Western Ontario and McMaster Universities Osteoarthritis Index
MAA is funded by the Imperial College President’s PhD Scholarship.
ZN, MAA, KM, and GGJ were involved in setting out the project aim and methodology. ZN conducted the literature search and wrote the original draft. ZN, MAA, and KM contributed to data curation and analysis. MAA and GGJ contributed to study design. GGJ supervised the conduction of the study, and reviewed and edited the manuscript. All authors had access to the raw data and have approved the final manuscript.
None declared.