This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
Stroke, a cerebrovascular disease, is one of the major causes of death. It causes significant health and financial burdens for both patients and health care systems. One of the important risk factors for stroke is health-related behavior, which is becoming an increasingly important focus of prevention. Many machine learning models have been built to predict the risk of stroke or to automatically diagnose stroke, using predictors such as lifestyle factors or radiological imaging. However, there have been no models built using data from lab tests.
The aim of this study was to apply computational methods using machine learning techniques to predict stroke from lab test data.
We used the National Health and Nutrition Examination Survey data sets with three different data selection methods (ie, without data resampling, with data imputation, and with data resampling) to develop predictive models. We used four machine learning classifiers and six performance measures to evaluate the performance of the models.
We found that accurate and sensitive machine learning models can be created to predict stroke from lab test data. Our results show that the data resampling approach performed the best compared to the other two data selection techniques. Prediction with the random forest algorithm, which was the best algorithm tested, achieved an accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and area under the curve of 0.96, 0.97, 0.96, 0.75, 0.99, and 0.97, respectively, when all of the attributes were used.
The predictive model, built using data from lab tests, was easy to use and had high accuracy. In future studies, we aim to use data that reflect different types of stroke and to explore the data to build a prediction model for each type.
Stroke is a neurological deficit, primarily because of acute central nervous system focal injury caused by a vascular issue. It is a major cause of disability and death worldwide [
In 2019, the American College of Cardiology/American Heart Association released the Guideline on the Primary Prevention of Cardiovascular Disease. The guideline recommends a complete assessment and examination of patients who are at risk of developing blockages in their arteries that may lead to a heart attack or stroke and might die as a result [
Besides assessing known risk factors for stroke, scientists are trying to develop lab tests that can predict stroke. One of the major advantages of using lab test results for prediction is that lab tests are commonly collected in clinical settings, and the information is often well documented in patients’ records. In this study, we explored data-driven approaches using supervised machine learning models to predict the risk of stroke from different lab tests.
Several studies have been able to identify independent laboratory tests that are correlated with stroke using descriptive statistical analysis. Sughrue et al [
These studies demonstrate the value of lab test results for predicting stroke. Our study aimed to leverage lab test results to build machine learning models for stroke prediction. We prepared the data sets using three data selection techniques for this study. After that, for each data selection technique, we applied four individual machine learning classifiers to prepare prediction models. We measured the performance of each prediction model using six different performance measures. Our results indicate that the data resampling technique outperformed the decision tree and random forest classifiers.
We used 10-fold cross-validation to perform the train and test approach. To train models, we used four different machine learning classifiers, and six performance measures were used to assess the performance of the models. The elaborated descriptions of the data sets, classifiers, and performance metrics that were used are given below.
Flow diagram of the study methodology. NHANES: National Health and Nutrition Examination Survey.
The NHANES survey was conducted to examine the health and nutritional status of adults and children in the United States; “NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation” [
Participant selection and prevalence of stroke in the National Health and Nutrition Examination Survey (NHANES).
List of the data attributes.
Featurea | Units |
Age | Years |
Gender | N/Ab |
Albumin, urine | ug/mL |
Creatinine, urine | mg/dL |
White blood cell count | 1000 cells/μL |
Lymphocytes | 1000 cells/μL |
Monocytes | 1000 cells/μL |
Segmented neutrophils | 1000 cells/μL |
Eosinophils | 1000 cells/μL |
Basophils | 1000 cells/μL |
Red blood cell count | Million cells/μL |
Hemoglobin | g/dL |
Hematocrit | % |
Mean cell volume | fL |
Mean cell hemoglobin | pg |
Mean corpuscular hemoglobin concentration | g/dL |
Red cell distribution width | % |
Platelet count | 1000 cells/μL |
Mean platelet volume | fL |
Cotinine, serum | ng/mL |
Red blood cell folate | mg/dL |
aAll data types were numeric, except for “gender,” which was nominal.
bN/A: not applicable; this type of data did not have units.
Several different machine learning algorithms can handle a binary classification problem. In this study, we used four machine learning algorithms: naïve Bayes, BayesNet, J48 (Java implementation of C4.5 algorithm), and random forest. The performance of the algorithms was evaluated and compared for stroke prediction using lab test results as features. Details of the algorithms are as follows:
The J48 algorithm creates a tree based on the C4.5 algorithm with pruning.
The random forest algorithm creates a forest of random trees and outputs the mode of the classes created by individual trees.
The naïve Bayes algorithm creates a classifier based on the naïve Bayes method, which assumes that all attributes are independent.
The BayesNet algorithm creates a classifier based on non–naïve Bayes, which does not assume that all attributes are independent.
In the cross-validation approach, the data sets are divided into several equal portions; in general, 5-fold and 10-fold cross-validations are used when the data sets are equally divided into 5 and 10 portions [
Model accuracy was evaluated based on the following measures: recall or sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and area under the curve (AUC) (or area under the receiver operating characteristic [ROC] curve) to compare the four classifiers. Details of these measures are as follows:
Sensitivity, also known as recall or true positive rate, is the number of true positives divided by the number of true positives plus the number of false negatives. It is the likelihood that the patient has a high risk of stroke [
Specificity, also known as the true negative rate, is the proportion of individuals classified as nonstroke to the total number of actual nonstroke cases. It is the likelihood that a patient who does not have a risk of stroke will have a negative result [
PPV, also known as precision, is the number of true positives divided by the number of true positives plus the number of false positives. It is the proportion of individuals who have suffered a stroke to the total number of participants classified as having a risk of stroke [
NPV is the percentage of negative tests in patients who are free from the disease or the proportion of individuals who have not suffered a stroke to the total number of participants classified as not having a risk of stroke [
Overall accuracy is the number of correctly classified instances over the total size of the data set [
The AUC is the area under the ROC curve, which is constructed by plotting the true positive rate against the true negative rate [
We will also look at the Pearson correlation coefficient value of each independent predictor to investigate the relationship between each lab test and risk of stroke.
In the NHANES data sets, 608 participants suffered from a stroke from 2011 to 2015. The median age of participants who had a stroke was 51 years for both men and women. The numbers of men and women who had a stroke were 220 (36.2%) and 190 (31.3%), respectively; 198 (32.6%) participants did not reveal their gender identity.
After the data collection process, the data were analyzed in three ways: without data resampling, with data imputation, and with data resampling. Data resampling techniques were used to tackle data imbalance problems in the data sets. These sampling techniques are widely used in machine learning–based prediction models in different areas [
In the third analysis, we resampled the data. After resampling, the prediction accuracy improved significantly for both the decision tree and random forest models, but only slightly for the naïve Bayes and BayesNet models.
Results of three data analysis techniques.
Technique and classifier | Accuracy | Sensitivity | Specificity | PPVa | NPVb | AUCc | |||
|
|||||||||
|
Naïve Bayes | 0.82 | 0.34 | 0.88 | 0.27 | 0.91 | 0.76 | ||
BayesNet | 0.82 | 0.38 | 0.89 | 0.37 | 0.90 | 0.88 | |||
Decision tree | 0.83 | 0.33 | 0.87 | 0.14 | 0.95 | 0.73 | |||
Random forest | 0.86 | 0.55 | 0.86 | 0.01 | 0.99 | 0.87 | |||
|
|||||||||
|
Naïve Bayes | 0.81 | 0.32 | 0.88 | 0.25 | 0.91 | 0.74 | ||
BayesNet | 0.86 | 0.53 | 0.92 | 0.54 | 0.92 | 0.85 | |||
Decision tree | 0.88 | 0.61 | 0.91 | 0.46 | 0.95 | 0.74 | |||
Random forest | 0.90 | 0.89 | 0.90 | 0.33 | 0.99 | 0.85 | |||
|
|||||||||
|
Naïve Bayes | 0.82 | 0.33 | 0.88 | 0.29 | 0.90 | 0.74 | ||
BayesNet | 0.87 | 0.53 | 0.93 | 0.57 | 0.92 | 0.85 | |||
Decision tree | 0.93 | 0.76 | 0.95 | 0.72 | 0.96 | 0.86 | |||
Random forest | 0.96 | 0.97 | 0.96 | 0.75 | 0.99 | 0.97 |
aPPV: positive predictive value.
bNPV: negative predictive value.
cAUC: area under the curve.
Performance comparison among three data selection techniques for the decision tree model. AUC: area under the curve; NPV: negative predictive value; PPV: positive predictive value.
Performance comparison among three data selection techniques for the random forest model. AUC: area under the curve; NPV: negative predictive value; PPV: positive predictive value.
Pearson correlation coefficient values of independent predictors.
Independent predictor of stroke | Pearson correlation coefficient ( |
Age | 0.26 |
Gender | 0.13 |
Red cell distribution width (%) | 0.18 |
Lymphocytes (%) | 0.15 |
Red blood cell folate (ng/mL) | 0.13 |
Segmented neutrophils (%) | 0.12 |
Hemoglobin (g/dL) | 0.11 |
Red blood cell count (million cells/μL) | 0.11 |
Hematocrit (%) | 0.09 |
Lymphocytes (1000 cells/μL) | 0.08 |
Segmented neutrophils (1000 cell/μL) | 0.07 |
From the previous section, we noticed that our models had the potential to perform stroke prediction using lab test data. Our results show that the random forest model was the best classifier after conducting the data resampling technique.
Also, several observations can be made from the results in
The correlations between these different lab tests and stroke were found in several studies. However, this is the first study that used all of these different attributes to build a prediction model using machine learning algorithms. Our results showed that a prediction model can be created using the random forest algorithm and could achieve an accuracy of 0.96.
Machine learning applications are becoming more widely used in the health care sector. The prediction of stroke using machine learning algorithms has been studied extensively. However, no previous work has explored the prediction of stroke using lab tests. The results of several laboratory tests are correlated with stroke. Building a prediction model that can predict the risk of stroke from lab test data could save lives. In this study, we created a prediction model using the random forest algorithm and achieved a 96% accuracy rate. The model can be integrated with electronic health records to provide a real-time prediction of stroke from lab tests. Because of the nature of the data, we could not predict the type of stroke: hemorrhagic or ischemic. In future studies, we aim to use data that provide information about different types of stroke to build prediction models for each type.
area under the curve
Centers for Disease Control and Prevention
mean platelet volume
National Center for Health Statistics
National Health and Nutrition Examination Survey
neutrophil-to-lymphocyte ratio
negative predictive value
positive predictive value
red cell distribution width
receiver operating characteristic
sequence number
Waikato Environment for Knowledge Analysis
EMA conducted the research design, data collection, and data analysis and wrote the original draft. AA assisted with the literature review of the lab tests. JL revised and edited the original draft and provided guidance throughout the whole research process. This study received no external funding.
None declared.