Predicting Overweight and Obesity Status Among Malaysian Working Adults With Machine Learning or Logistic Regression: Retrospective Comparison Study

Background Overweight or obesity is a primary health concern that leads to a significant burden of noncommunicable disease and threatens national productivity and economic growth. Given the complexity of the etiology of overweight or obesity, machine learning (ML) algorithms offer a promising alternative approach in disentangling interdependent factors for predicting overweight or obesity status. Objective This study examined the performance of 3 ML algorithms in comparison with logistic regression (LR) to predict overweight or obesity status among working adults in Malaysia. Methods Using data from 16,860 participants (mean age 34.2, SD 9.0 years; n=6904, 41% male; n=7048, 41.8% with overweight or obesity) in the Malaysia’s Healthiest Workplace by AIA Vitality 2019 survey, predictor variables, including sociodemographic characteristics, job characteristics, health and weight perceptions, and lifestyle-related factors, were modeled using the extreme gradient boosting (XGBoost), random forest (RF), and support vector machine (SVM) algorithms, as well as LR, to predict overweight or obesity status based on a BMI cutoff of 25 kg/m2. Results The area under the receiver operating characteristic curve was 0.81 (95% CI 0.79-0.82), 0.80 (95% CI 0.79-0.81), 0.80 (95% CI 0.78-0.81), and 0.78 (95% CI 0.77-0.80) for the XGBoost, RF, SVM, and LR models, respectively. Weight satisfaction was the top predictor, and ethnicity, age, and gender were also consistent predictor variables of overweight or obesity status in all models. Conclusions Based on multi-domain online workplace survey data, this study produced predictive models that identified overweight or obesity status with moderate to high accuracy. The performance of both ML-based and logistic regression models were comparable when predicting obesity among working adults in Malaysia.


Introduction
Overweight and obesity are global health issues that are increasingly recognized as major public health concerns in lowand middle-income countries. In Malaysia, 1 in 2 adults, particularly those of working age (ie, aged 30 to 65 years), is either overweight or obese [1]. This is concerning, as obesity prevalence is rising at a very high rate (3.3%) in this country [2]. The increase in overweight and obesity is related to increases in noncommunicable diseases, the mortality rate, and health care costs, as well as decreases in productivity and economic growth [2][3][4][5].
Obesity is a chronic, relapsing, multifactorial disease that is attributable to individual or biological, psychological, sociocultural, local, and global environmental factors [6][7][8]. As obesity is largely preventable, understanding the determinants of and risk factors for obesity is important for the development of population-based strategies to prevent obesity. Identifying individuals at high risk of obesity enables early intervention to modify obesity risk factors. Conventional statistical methods, such as generalized linear or regression models with a low number of predictor variables, have been successful in identifying obesity [9]. However, given the complexity of the etiology of obesity, regression modeling may not be adept at disentangling nonlinear and interdependent relationships among factors for obesity prediction.
Machine learning (ML) is an advanced data analytical method that uses fine-tuned algorithms to characterize and predict outcomes by learning from data without being explicitly programmed to do so. As health data become more available and accessible, ML techniques are increasingly used to perform such complex tasks in obesity research as classifying and predicting obesity at individual and group levels [10][11][12]. ML techniques have advantages over regression modeling, as they are data driven and do not necessitate a priori assumptions, such as normality, linearity, and multicollinearity. In addition, ML techniques are capable of handling high-dimensional and complex data sources beyond numeric sources, and therefore may be able to provide new insights into unexplored predictor variables [9,13]. Thus, ML techniques are likely to be more accurate than regression models in obesity prediction [14].
A wide range of ML-based algorithms incorporating various predictors and risk factors, training set sizes, and degrees of implementation have been used to predict adult obesity [11,14]. The reported accuracy of ML algorithms to predict adult obesity as a binary outcome ranges broadly, from 0.59 to 0.97 for overall accuracy [15][16][17][18][19][20][21][22][23][24] and 0.51 to 0.99 for the area under the curve (AUC) [15,19,20,23,24]. A review suggested that ML-based models predicted childhood and adolescent obesity much better than linear regression [13]. However, studies that have compared the performance of different ML algorithms with regression in adult obesity have reported mixed findings. Some evidence suggests superior performance for ML models compared to regression models [19,21], while some suggests similar or inferior performance [15,17,18,23]. These inconsistencies may partly be due to data quality, variable selection, and the use of different approaches to model fitting, parameter tuning, and validation among studies.
The Malaysia's Healthiest Workplace by AIA Vitality survey is a large, observational online survey of the health and well-being of Malaysian employees [25]. Since 2017 (with the exceptions of 2020 and 2021, because of the COVID-19 pandemic), this annual online workplace survey has collected comprehensive information on Malaysian employees' sociodemographic characteristics, physical and mental health, smoking and alcohol habits, physical activity, diet, musculoskeletal health, and work environment as a database to inform workplace interventions and improve productivity [25]. In this study, we propose an ML-based model to predict overweight and obesity status among employees in Malaysia based on multi-domain variables collected in this large survey. We evaluated the performance of 3 ML algorithms and compared them with logistic regression for the prediction of overweight and obesity status. We hypothesized that ML algorithms would outperform logistic regression models in predicting overweight and obesity status based on BMI.

Study Design and Data
This is a retrospective study of predictive model derivation using data from the Malaysia's Healthiest Workplace by AIA Vitality 2019 survey. This online survey, commissioned by AIA Malaysia and delivered in partnership with RAND Europe, was administered between May and August 2019. The survey, which has taken place annually in Malaysia from 2017 to 2019, aimed to determine workplace productivity and multi-domain factors that influence workplace productivity. Employees from small, medium, and large organizations were invited to answer a 40-minute employee survey questionnaire about their general health, lifestyle behaviors, mental health status, and work environment. The study rationale and methodology have been discussed in detail elsewhere [26][27][28].
The initial data set comprised data submitted by 17,595 participants from 230 companies. We initially included 16,931 participants resident in Malaysia for whom data were available for body weight and height. If they were women, participants were included if they were not pregnant. Participants with (1) body weight more than 200 kg, (2) height more than 200 cm, or (3) BMI values of more than 60 kg/m 2 or less than 14 kg/m 2 were deemed to have implausible values and were excluded from analysis. After excluding 71 participants who reported implausible weight, height, or BMI, the final data set included 16,860 of 16,931 participants (95.8%) ( Figure 1).

Ethics Approval
The use of the data was approved by the Research Ethics Committee Universiti Kebangsaan Malaysia (JEP-2020-707). As the obtained pooled data were anonymized and deidentified, informed consent from the participants was not required. The study results were presented following the reporting guidelines and recommendations for ML [29,30].

Data Preprocessing
An overview of data preprocessing and model development is illustrated in Figure 1. Data preprocessing involved the selection of participants and variables (features) followed by mean substitution of missing data, one-hot encoding of categorical variables, and min-max scaling for data normalization.

Outcome Variable
The outcome of interest was overweight or obesity status, defined as a BMI of 25 kg/m 2 or more [31]. This was calculated by dividing the self-reported body weight (in kg) by the squared height (in m 2 ). The cutoff of 25 was chosen as Southeast Asians are reported to have higher body fatness at a lower BMI than Europeans [32,33] and are therefore predisposed to elevated cardiovascular risk factors and other adverse effects of obesity at lower BMI ranges (23 kg/m 2 to 25 kg/m 2 ), as observed in local studies [34,35]. Further, a recent study suggested that a BMI of 24.8 kg/m 2 is an optimal BMI cutoff to define obesity among Malaysian adults based on percentage of body fat [36].

Predictor Variables
Initially, the data set consisted of 556 predictor variables. A total of 473 variables that contained redundant information or text information with more than 20% missing or nonapplicable data were removed from the data set. The reduced data set included 83 variables that were grouped into the following 4 main domains: sociodemographic characteristics, job characteristics, status perception, and lifestyle-related behaviors (the list of predictor variables is included in Multimedia Appendix 1).
Categorical variables (n=16) were one-hot encoded into binary variables. For instance, weight satisfaction was assessed by a categorical question that prompted participants to select 1 of 3 statements that best described how they felt about their current body weight. The participants indicated whether they (1) were happy with their weight, (2) were not happy with their weight but had no intention of losing or gaining weight, or (3) wanted to change their weight. This categorical variable was subsequently encoded into 3 binary variables (ie, "weight_satisfaction_1," "weight_satisfaction_2," and "weight_satisfaction_3"). Finally, prediction models were trained and tested on the final 165 normalized variables. A total of 120 (73%) of these 165 predictor variables were binary (yes/no) variables.

Model Development
The R (version 3.6.1; R Software Foundation) package "caret" (version 6.0-90) was used for model training and validation [37]. Based on a random 70:30 split, a total of 11,803 participants, including 4934 (41.8%) with overweight or obesity, were used to train the model. The remaining 30% of the participants (5057/16,860) ware used to predict the obesity outcome during model validation.
Three supervised, nonlinear ML classifiers were applied, namely extreme gradient boosting (XGBoost), random forest (RF), and a support vector machine (SVM). XGBoost is a tree-based ensemble algorithm that uses a boosting method to create multiple decision trees sequentially. The algorithm combines the predictions of weak decision trees to produce a more robust final model. Improvised on the gradient boosting framework, XGBoost is a popular learning algorithm due to its high predictive power and efficiency in handling continuous and categorical data using relatively low computational power [38]. RF is also an ensemble method but uses a bagging method to train multiple decision trees in parallel using random selection of predictors. The final model merges predictions from each decision tree to predict a class [39]. Finally, SVMs use a kernel-based algorithm to construct a decision boundary or hyperplane that best separates the data into 2 classes in n-dimensional space. SVMs use extreme cases, also known as support vectors, to create an optimal hyperplane that has the maximum margin between the vectors [40].
In this study, logistic regression (LR) was compared with the 3 ML models. Logistic regression is a part of the generalized linear model and is the conventional classifier for categorical outcome responses. The algorithm assumes a linear relationship between the predictor variables and the log odds (probability) of obesity as the outcome in this study. All predictor variables were included in the model, regardless of statistical significance, to maintain comparability across models. The goodness of fit of the logistic regression model was demonstrated by a McFadden R 2 value of 0.3452 and a Nagelkerke R 2 value of 0.3452. The probability produced by the logistic regression was subsequently assigned to a binary outcome (overweight/obese or not), based on the customary probability cutoff point of 0.5.
The details of the package, functions, and parameters used in this study are presented in Multimedia Appendix 2. Using a grid search approach, the best combinations of parameters were employed for each algorithm. All models were tuned using 10-fold cross-validation repeated 3 times. Using the varImp function of the caret library, model-specific metrics were used to identify the best-performing predictors. To present the relative ranking of each predictor, the measures of importance for all models were scaled to have a maximum value of 100.

Model Evaluation
The final trained models were saved and restored for prediction using a separate test data set (n=5057) and for comparison with other models. Classification metrics were obtained from the confusion matrix (confusionMatrix) embedded in the caret package. A prediction of overweight or obesity status was considered a positive prediction. Performance was assessed by 4 main metrics (the first 3 metrics are limited in their discriminating power in selecting the best classifier [41], but they are the most common metrics used in the literature and are therefore presented for comparison with other studies): (1) accuracy, the proportion of correct predictions divided by the total number of instances evaluated; (2) sensitivity (also known as the true positive rate), the proportion of actual positives (ie, overweight or obese status) that were correctly predicted; (3) specificity (also known as the true negative rate), the proportion of actual negatives (ie, no overweight or obese status) that were correctly predicted; and (4) AUC, which represents a tradeoff between sensitivity and specificity and served as the main metric for model evaluation. AUC is extracted from the receiver operating characteristic (ROC) curve, which is the probability plot of the true positive rate (ie, sensitivity) against the false positive rate (ie, 1-specificity). An AUC above 0.5 indicates the model is better capable of distinguishing positives (ie, subjects with overweight or obesity) from negatives. In general, an AUC of 0.7 to <0.8 is considered acceptable, 0.8 to <0.9 excellent, and 0.9 or above outstanding predictive performance [42]. The ROCs and corresponding AUCs were computed and plotted with the pROC package.
The performance metrics of all predictive models are presented as point estimates with 95% CIs. For accuracy, sensitivity, and specificity, 95% CIs were calculated assuming a Gaussian distribution of the proportion. For AUCs, 95% CIs were derived through resampling with the bootstrap percentile method with 2000 repetitions. Model comparisons were made based on the 95% CIs of the 4 performance metrics.

Study Characteristics
The analysis included 16,860 participants, of whom 41% (n=6904) were male and 41.8% (n=7048) had overweight or obese status. The male participants were significantly older, and the distributions for ethnicity, education level, marital status, occupation, individual monthly income, and obesity status were also significantly different by sex (P<.001 for all; Multimedia Appendix 3). Table 1 presents the predictive performance of the ML and logistic regression models. Among the 4 models, the RF and LR models had lower sensitivity but higher specificity. While XGBoost exhibited the best mean accuracy and AUC, overall accuracy was similar across all models based on the 95% CIs. The ROCs of the 4 models are illustrated in Figure 2. sensitivity than the models for male participants. Overall accuracy and AUC were similar across all 4 models, with the 2 algorithms showing no sex-specific differences in predictive performance.

Model Comparisons
The ranking of the most important predictors of the models is summarized in Figure 3. In order of importance, the top 4 predictor variables for the XGBoost ML model were weight satisfaction, ethnicity, age, and gender. For the LR model, the top predictor variables were weight satisfaction, physical health, age, and diet satisfaction.    Table 2. AUC: area under the curve.

Principal Results
This study applied various ML models and compared their performance to the performance of a conventional logistic regression model in predicting overweight or obesity status among working adults in Malaysia. Our results showed that ML and logistic regression had similarly acceptable or excellent predictive performance, as assessed by the metrics of accuracy (values ranged from 70% to 75%) and AUC (values ranged from 78% to 81%), for both the overall and sex-specific models.

Comparison With Prior Work
Our findings, based on data collected annually as part of a large-scale online survey of employees, compare favorably to those of a recent study by Thamrin et al [23] that also used a large Southeast Asian sample (N=618,898), in Indonesia. That study employed logistic regression, classification and regression trees, and a naive Bayes classifier for obesity prediction based on data for sociodemographic characteristics, diet, physical activity, lifestyle behaviors, and health status from the Indonesian Basic Health Research periodic survey. The study reported accuracy between 70.8% and 72.2% and an AUC between 0.75 and 0.80, which is comparable to the performance of our models (mean accuracy 71%-73.3% and AUC 0.78-0.81). While there is no definite standard for acceptable accuracy, the models in our study recorded accuracy greater than 70% and AUC greater than 0.7, which is better than the accuracy and AUC of past models that used novel predictors, including genetics [20,24], detailed dietary intake [18,21], and objectively measured physical activity [15].
In this study, the overall performance of the ML models, namely XGBoost, RF, and SVM, was found to be similar to logistic regression, as indicated by the overlapping 95% CIs. This corroborates the findings of a systematic review of 71 studies, which concluded that ML did not offer greater performance benefits than logistic regression for clinical prediction models [43]. Specifically, for obesity prediction, Ferdowsy et al [17] employed 8 algorithms, in addition to logistic regression, in a data set that included 21 well-established risk factors for obesity, such as diet, physical activity, lifestyle behaviors, and disease history. Their study recorded the highest accuracy (97%) with the logistic regression model, which outperformed ML algorithms including k-nearest neighbor, RF, a multilayer perceptron, an SVM, a naive Bayes classifier, adaptive boosting, a decision tree, and a gradient boosting classifier for obesity prediction [17]. Kim et al [18] modeled the effects of 7 dietary factors on overweight or obesity status using data from the Korea National Health and Nutrition Examination Survey. That study showed that the predictive accuracy of logistic regression (0.62486) was higher than that of decision trees (0.54026) and similar to that of a deep neural network model of deep learning (0.62496). Taken together, comparative studies that deal with a small number of strong predictor variables [15,17,18,23] suggest that regression models are likely to perform better than, if not as well as, ML models in obesity prediction.
Another possible reason for this similar performance is that the observed relationships among the significant predictors of obesity in this sample may appear linear on the log-odd scale. Hence, logistic regression was not disadvantaged by assuming linearity in these predictors. In this study, we employed 3 nonlinear ML classifiers due to the fact that many variables, including intrapersonal and socioeconomic factors that affect body weight, such as age, sex, and gender, are nonlinear in nature [44]. However, it could be hypothesized that these nonlinear ML algorithms may have been less proficient at modeling the present data set because the data mostly consisted of binary variables (120/165, 73%).
It is important to acknowledge that different ML algorithms may fit and perform differently when used with different data sets. Guided by previous findings that obesity determinants are different for men and women [19,45], we developed separate, sex-stratified models for overweight or obesity status prediction. However, the predictive accuracy of the sex-specific models was similar to the overall or combined models. This suggests that separate prediction models for each sex are not warranted in this Malaysian adult population.
In terms of predictor variables, weight satisfaction appears to be a consistent, novel predictor in all predictive models, together with such well-established risk factors for obesity as ethnicity, age, and gender. Weight satisfaction is an attitudinal component of body image, which reflects individuals' feelings and thoughts about their weight [46]. The variable "weight_satisfaction_1," which represents satisfaction or contentment with current body weight, appears to have had the most influential power in the trained model to predict overweight or obesity status ( Figure  3). This novel finding is consistent with previous studies showing that self-perception of body weight is an important determinant of weight management behaviors and lifestyle practices [47]. However, the relationships between weight satisfaction and weight-related behaviors are complex and multifaceted. Depending on sex, race, ethnicity, accuracy of weight perceptions, and psychological factors, weight satisfaction may promote positive diet and physical activity behaviors or lead to maladaptive or unhealthy weight-control or dieting behaviors [48][49][50]. As weight satisfaction and dissatisfaction appear to be mostly stable in adulthood [51,52], we posit that this subjective variable may be cognitively easier and more reliable to report than body weight and height among adults. This finding supports the usefulness of including weight satisfaction as a proxy for actual weight status in studies and e-surveys, where anthropometry measurements may not be available or feasible.

Strengths and Limitations
To the best of our knowledge, this is the first study to employ ML to predict overweight or obesity status in an adult working population in Malaysia. This study used rich data from a large annual survey that included a wide, multi-domain set of predictor variables in working adults with a broad range of ages (18 to 88 years) and occupations. Another strength of the study lay in its employment of advanced ML classifiers with careful cross-validation (to avoid model overfitting) and parameter optimization. The variable importance technique afforded novel insights into significant factors that are correlates of overweight or obesity status in a Malaysian working population.
This study was also limited in several ways. First, the study findings do not infer temporality or causality of the observed predictor-obesity relationships due to the use of a cross-sectional design. However, the findings suggest putative variables that could be explored using novel model interpretation techniques such as Shapley additive explanations [53] and could be considered for further testing in longitudinal or trial settings. Second, mislabeling of obesity was likely, due to the reliance on self-reported body weight and height to derive BMI as a surrogate measure of general obesity. Notably, the prevalence of individuals with overweight or obesity in this study (4934/11,803, 41.8%) was lower than the national prevalence of 50.1% [1]. Such errors, or noise, may have reduced the performance of the models. Therefore, the current findings represent conservative estimates of predictive accuracy. Finally, we acknowledge that the generalizability of our models is limited, as validation was based on testing data that came from the same sample. Validating the models with an external data set would more closely approximate the real performance of the prediction models. Future work is needed to confirm the external validity and reproducibility of the models in other data sets, such as the Malaysia's Healthiest Workplace surveys from 2018 or later.

Conclusions
Using a multi-domain set of predictors from a large online employee survey, we constructed models that were able to predict overweight or obesity status in a Malaysian working population with moderate to high accuracy. Weight satisfaction was the most prominent factor, followed by ethnicity, age, and gender, in differentiating individuals with overweight or obese status. Among the 3 ML models (XGBoost, RF, and SVM), XGBoost had the highest accuracy and AUC, but the overall performance of all ML-based models was similar to the logistic regression model for obesity prediction.
This study is complementary to and extends the growing literature showing that ML may be used to predict overweight or obesity status based on online survey data with reasonable accuracy. Besides unveiling distinctive factors that influence weight status in this Asian population, this work also produced potential models or algorithms that can be used to screen for overweight or obesity status in community settings, especially when body weight and height data are not available. A natural progression of this study would be to test the performance of the produced models in an external data set to establish the external validity of the findings.