Heart Rate Measurement Accuracy of Fitbit Charge 4 and Samsung Galaxy Watch Active2: Device Evaluation Study

Background Fitness trackers and smart watches are frequently used to collect data in longitudinal medical studies. They allow continuous recording in real-life settings, potentially revealing previously uncaptured variabilities of biophysiological parameters and diseases. Adequate device accuracy is a prerequisite for meaningful research. Objective This study aims to assess the heart rate recording accuracy in two previously unvalidated devices: Fitbit Charge 4 and Samsung Galaxy Watch Active2. Methods Participants performed a study protocol comprising 5 resting and sedentary, 2 low-intensity, and 3 high-intensity exercise phases, lasting an average of 19 minutes 27 seconds. Participants wore two wearables simultaneously during all activities: Fitbit Charge 4 and Samsung Galaxy Watch Active2. Reference heart rate data were recorded using a medically certified Holter electrocardiogram. The data of the reference and evaluated devices were synchronized and compared at 1-second intervals. The mean, mean absolute error, mean absolute percentage error, Lin concordance correlation coefficient, Pearson correlation coefficient, and Bland-Altman plots were analyzed. Results A total of 23 healthy adults (mean age 24.2, SD 4.6 years) participated in our study. Overall, and across all activities, the Fitbit Charge 4 slightly underestimated the heart rate, whereas the Samsung Galaxy Watch Active2 overestimated it (−1.66 beats per minute [bpm]/3.84 bpm). The Fitbit Charge 4 achieved a lower mean absolute error during resting and sedentary activities (seated rest: 7.8 vs 9.4; typing: 8.1 vs 11.6; laying down [left]: 7.2 vs 9.4; laying down [back]: 6.0 vs 8.6; and walking slowly: 6.8 vs 7.7 bpm), whereas the Samsung Galaxy Watch Active2 performed better during and after low- and high-intensity activities (standing up: 12.3 vs 9.0; walking fast: 6.1 vs 5.8; stairs: 8.8 vs 6.9; squats: 15.7 vs 6.1; resting: 9.6 vs 5.6 bpm). Conclusions Device accuracy varied with activity. Overall, both devices achieved a mean absolute percentage error of just <10%. Thus, they were considered to produce valid results based on the limits established by previous work in the field. Neither device reached sufficient accuracy during seated rest or keyboard typing. Thus, both devices may be eligible for use in respective studies; however, researchers should consider their individual study requirements.


Introduction
Background Wearables such as smart watches and fitness trackers enable data recording in real-life settings, where biomedical signals cannot be easily captured with conventional or clinical devices. They can provide unobtrusive, economic, high-resolution, longitudinal recording capabilities for various signals, including accelerometer and photoplethysmogram (PPG) data [1,2]. This makes them particularly interesting for use in longitudinal medical studies and biomedical research. Observational studies especially benefit from the unobtrusive and longitudinal recording characteristics of fitness trackers and wearables across a variety of medical disciplines [3][4][5][6][7][8][9]. In addition to clinical studies, fitness trackers are also usable in other applications. These include activity feedback, activity promotion, weight management, disease monitoring, disease diagnostics, stress and sleep monitoring, and health care surveillance [10][11][12][13][14][15].
An important prerequisite for the use of wearables in studies and connected applications is adequate accuracy and, thus, data quality. Without sufficient validation, the reliability of the recorded data is unknown, which is the case for many modern end consumer devices. Thus, stringent upfront validation is not only a necessity for meaningful research but also prospective applications.

Related Work
The Fitbit Charge HR series (Fitbit) is one of the most frequently validated devices. The study by Lee [16] reviewed the first device generation with 10 college students under free-living conditions. Each participant was asked to conduct normal day activities for 8 hours, and the heart rate (HR) was recorded and evaluated every minute against a Polar HR chest strap monitor. They concluded that the device was not accurate, with a mean absolute percentage error (MAPE) of 9.17% (SD 10.9%) when worn on the nondominant hand. Brazendale et al [17] evaluated the HR measurements of 39 children. The evaluation was performed on a per-minute basis, and the MAPE was reported as 6.9%. Thus, the authors concluded that wearable fitness trackers provide HR measurements comparable with a criterion field-based measure.
The data of 50 intensive care unit patients monitored over 24 hours were used for evaluation by Kroll et al [15]. They recorded HR values every 5 minutes and identified a median difference of 1 beats per minute (bpm) between the derived HR of the fitness tracker and the electrocardiogram (ECG)-derived HR.
A higher comparison frequency was chosen by Jo et al [18]. By measuring and comparing the HR every second, 24 participants completed a 77-minute protocol comprising several activities, including cycling, walking, jogging, running, and other sports exercises. A 12-lead ECG served as the criterion device. The authors reported a mean bias of −8.8 bpm and concluded that the device by Fitbit does not satisfy the validity criteria, particularly during higher exercise intensities.
The second release of the Fitbit Charge HR series was evaluated by Reddy et al [19], Thomson et al [20], and Benedetto et al [21], yielding different results on the device accuracy. To the best of our knowledge, the only validation study for Fitbit Charge 3 was performed by Muggeridge et al [22], who stated that the Fitbit Charge 3 performed well only during resting and walking-like conditions but otherwise assessed the accuracy to be overall poor.
To the best of our knowledge, no validation studies on the Samsung Galaxy Watch Active series exist as of today. However, other Samsung smart watches have been validated in the past. The measurements of the Samsung Gear S were investigated by Wallen et al [23], with 22 participants in rest, walking, running, and cycling. Out of a total of 4 devices, Samsung Gear S demonstrated the greatest variability in HR measurements. Shcherbina et al [24] examined the Samsung Gear S2 among 6 other devices with 60 participants from diverse backgrounds, performing a range of activities, including sitting, walking, running, and cycling. Of the validated devices in the study, Samsung Gear S2 showed the highest overall error, particularly during sitting. In another study by El-Amrawy and Nounou [25], the device also showed the lowest accuracy compared with 17 other devices and a clinical pulse oximeter as a criterion device.

Objective
The validation study presented here was conducted as an initial groundwork for a large-scale observational study in obstetrics. As a pilot study, we aim to assess the performance of the selected devices in a healthy population to initially determine eligibility for longitudinal medical studies in general. We were particularly interested in the performance and accuracy of HR measurements, which are analyzed in detail in the following sections. This work is the first to validate the Fitbit Charge 4 and Samsung Galaxy Watch Active2 (Samsung Group).

Overview
Details on the participants, experimental procedure, used devices, validation metrics, processing, and evaluation are outlined in the following sections. Where applicable and possible, we adhered to several common grounds, guidelines, and best practices for wearable HR validation, which have been published in the more recent past [2,12,26].

Ethics Approval
The study was approved by the ethics committee of Friedrich-Alexander Universität Erlangen-Nürnberg (106_13 B). The participants provided informed consent to participate.

Recruitment
Recruitment was conducted via mailing lists and direct contact. We were unable to perform a power calculation for sample size estimation as the selected devices have not been investigated in the past, and thus, no information on effect sizes or variances was available. Consequently, we aimed at a sample size of approximately 20 to 25 participants, which is in line with previous HR evaluation studies [18,19,22,27,28]. Exclusion criteria was a major underlying medical condition affecting the participants' physical capability or increasing the risk of injury. Assessment was conducted using the Physical Activity Readiness Questionnaire [29]. Ultimately, 23 participants were recruited.

Devices and Gold Standard
We aimed to investigate the accuracy of Fitbit Charge 4 and Samsung Galaxy Watch Active2. Fitbit Charge 4 was released in March 2020. According to the manufacturer, it has a battery runtime of up to 7 days, which makes it particularly interesting for longitudinal studies [30].
The Galaxy Watch Active2 is a smart watch using Tizen OS as the operating system. Its PPG sensor uses 8 photodiodes [31]. The smart watch is available in several sizes and editions; our study used a 40 mm-sized version without long-term evolution. Furthermore, the device is able to record the ECGs. It is possible to derive blood pressure measurements through the PPG sensor; an upfront validation with a blood pressure cuff is required beforehand, and it is recommended to repeat this validation every 4 weeks [32,33]. Tizen OS is extendable, and a documentation of several available application programming interfaces (APIs) to create custom applications and interact with the device is available on the web [34]. This includes functions for accessing nearly all the built-in sensors. The available functions are not limited to the reading of HR and RR intervals but also allow raw data access to nearly all built-in sensors, particularly the PPG sensor. These features make the device interesting for medical studies, as their own algorithms for data processing can be used.
A Mind Media NeXus-10 MKI (Mind Media BV) was used as the gold standard. The ECG Holter device is a certified medical device of class 2a (EU). Data are transferred in real time via Bluetooth to a computer running a manufacturer-supplied software called Biotrace+ (Mind Media) [35], which displays and allows the export of HR, heart rate variability, and ECG data.

Study Procedure
The study was conducted in an indoor laboratory environment on 7 different days between July 30, 2020, and September 21, 2020. As the data recording was conducted during summer without air conditioning, ambient temperatures were comparably high for Northern Bavaria, causing sweaty skin surfaces in some cases. This can induce additional noise, electrode loss, or affect the PPG signal measurement of wearable devices.
After receiving information on the study procedure and aims, participants filled out the activity readiness questionnaire and respective consent forms. Participants were then supplied with the 2 wearable devices and were asked to place 1 device on each arm, ensuring that the sensor was in good contact with the skin and that the devices were fitted comfortably on the arms. The study adviser determined which device should be placed onto which arm, and devices were equally placed on the left or right arm across all participants. As the Fitbit app has a setting to determine whether the device is placed on the dominant or nondominant arm, this setting was configured accordingly by the study adviser based on the participant information. No such setting exists for the Galaxy Watch Active2. Subsequently, the Mind Media NeXus-10 MKI's electrodes were placed in a lead two position. To reduce noise, electrodes were placed on the torso, not the extremities. As the Holter device was equipped with a handbag-like body strap, it was hung over the shoulder of participants to increase freedom of movement. This positioning method was supported by the manufacturer. Recording started at least 30 seconds after placement of the electrodes, ensuring sufficient time for adaption for both wearables and the ECG algorithms.
The participants conducted an experimental protocol covering 10 subsequent tasks. Each task lasted between 1 and 2 minutes. The protocol anticipated a total length of 15 minutes. The chosen activities originate from activity recommendations for women with pregnancies, who are the prospective target group in our anticipated larger study. Participants were asked to conduct activities at their own pace to resemble activities as they would be conducted free living by the target group. With transitions between the individual study protocol phases, the recordings had an average duration of 19:03 minutes. We tried to minimize transition or relaxation phases between activities to ensure that the respective HR levels were similar between adjunct activities. If minor slack times (usually <10 seconds) occurred between activities (eg, because of instructions by the study adviser or a move of position between activities), these slack times were not included in the individual activity analysis. Our overall goal was to initially start with resting and sedentary activities (seated rest, typing, laying down [left], and laying down [back]), then continually increase HR using low intensity (standing up and walking at a slow pace) and high intensity (walking at brisk pace, climbing stairs, and squat work out) activities. The full list of activities and tasks is presented in Table 1. • Sit down directly after the workout, relax your breathing, and remain without motion.

•
Aims to assess drastic changes in heart rate from high activity to rest

Fitbit Charge 4
To increase the sampling frequency of the Fitbit Charge 4, the device was set to the training mode before the first study run. This produced a HR measurement every 1 to 5 seconds, thus resulting in a sampling frequency between 0.2 and 1 Hz. As the fitness tracker is linked to a user account in Fitbit's cloud, the data were accessed through the Fitbit Web API using representational state transfer queries and the Postman software.
The API only provides access to HR measurements, and no PPG raw data or RR intervals are provided.
To compare data on a per-second basis, the HR values required upsampling. When not provided with a HR measurement every second, missing values were imputed using the next available HR value.

Samsung Galaxy Watch Active2
A custom application for Samsung's Tizen OS was developed. The Human Activity Monitor API was used to retrieve HR and RR intervals. All retrieved data were saved in JSON format to files and downloaded to a computer. The Human Activity Monitor API provides HR and RR interval data with a sampling rate of 25 Hz (the provided callback function to the humanactivitymonitor.start function is called every 40 milliseconds). However, the data are inconclusive: the HR changes more frequently than physiologically explainable; that is, the API provides up to 5 HR changes (from 86 to 85 to 86 to 85 to 84) within a time frame as small as 300 milliseconds. At the same time, the reported RR interval occasionally remains unchanged over periods >10 seconds. Thus, we decided to sample the HR and RR interval data at 1 Hz. A minority of the data was sampled at a higher frequency and manually downsampled to 1 Hz.

Mind Media NeXus-10 MKI
The criterion device recorded ECG data at a sampling frequency of 256 Hz. Furthermore, it contained internal peak detection algorithms, also providing derived HR and RR interval data at 32 Hz. As stated before, data were transferred via Bluetooth from the Holter device to a computer running Mind Media Biotrace+. The data were then exported from the Mind Media Biotrace+ software as a CSV file. Activity sections were recorded and annotated by the study adviser during the study execution in software running on a laptop computer.
Owing to the nature of the study protocol, some activities were prone to noise. Particularly during squats and stair climbing, ECGs were sometimes noisy, and the manufacturer-supplied software was apparently unable to correctly identify R peaks, resulting in erratic and evidently wrong HR and RR interval data. This was particularly true for squat and stair-climbing activities.
To cope with this issue, the raw criterion ECG was again processed in Python, using the ECG function from the BioSPPY library [36]. Subsequently, the R peaks were manually revised by a human annotator and corrected. We then used a self-developed function to extract the HR from the RR intervals.
Finally, the data were downsampled to 1 Hz. If >1 HR value occurred during 1 second (as the HR was >60 bpm), the respective values were averaged.

Data Exclusion
Although manual data processing was applied to ensure high data quality, some recorded criterion ECGs were too noisy and unusable for comparison. Data were excluded if adequate criterion device recordings were unavailable but not if the measured data of the examined devices were evidently incorrect, as this situation could also appear in real-life use.
As a result, data of 3 participants (IDs 6, 8, and 23) had to be completely excluded. In addition, data of 2 participants were excluded for the squat and walking stairs activity (ID 5 and ID 7). After the squat activity, electrodes of 2 participants (ID 2 and ID 7) detached, and thus, no data were available. As stated before, a detailed overview of the conducted manual data correction and excluded activities of individual participants is provided in Table 2.

Data Synchronization
As the software of the validated fitness trackers is mostly closed source, their exact time measurement and determination mechanism are unknown. Furthermore, the on-device signal processing may cause additional delays. Therefore, we did not rely on exact time stamps for device synchronization but instead used another synchronization technique.
Synchronization of the signals was performed on the previously downsampled signals of all 3 devices (1 Hz, ie, 1 HR value per second). We conducted the synchronization between the individual validated devices and our HR reference by maximizing the Pearson correlation coefficient (PCC). The measurements were then shifted by the respectively determined time delays. This provided very similar and, in many cases, equal results to a shift through cross-correlation but showed better visual and metric results in a minority of edge cases.
Absolute error analysis was conducted using the mean absolute error (MAE) and MAPE as key metrics. We defined MAE as the average absolute distance between the HR of the validated device and the criterion device. MAPE is the percentage difference between the reference and the respective device values. The limits of agreement and mean error (bias) were derived from Bland-Altman plots, which also visually aided in the interpretation of the results. Correlation analysis was performed using the Lin concordance correlation coefficient (CCC), as suggested by Sartor et al [12,41,42]. PCC was additionally reported for completeness but not analyzed.

Participants
In total, 23 healthy individuals participated in the study (n=10, 43% women and n=13, 57% men). The demographics and details of the participants are shown in Table 3. Most participants were university students and staff members. Given the location of the university, Fitzpatrick skin type 2 was overrepresented (3×type 1, 15×type 2, 2×type 3, 1×type 4, and 2×type 5).

HR Measurement
The key results of this validation study are summarized in Table  4. In total and across the entire experiment duration (ie, all activities), both devices achieved very similar values for MAE, MAPE, and PCC. Although the Fitbit Charge 4 slightly underestimated the HR by −1.66 bpm (bias), the Samsung Galaxy Watch Active2 overestimated the HR by 3.84 bpm (bias).
In resting and sedentary activities (seated rest, typing, and laying down) and slow walking, the Fitbit Charge 4 achieved lower absolute and absolute percentage error rates. During standing up and all other physical activities, the Samsung Galaxy Watch Active2 outperformed the Fitbit Charge 4.
A particularly high bias (ie, mean difference) was observed by the Fitbit Charge 4 during standing up (−7.95 bpm) and squats (−12.52 bpm). The Samsung Galaxy Watch Active2's highest bias was measured during typing (8.63 bpm) and laying down on the left side (6.01 bpm).
The level of agreement of the Samsung Galaxy Watch Active2 is particularly broad during activities 1 to 5. The cause was a non-or excessive recorded HR in participant 20 during these activities, where the device recorded an average HR of 146, 148, 181, and 176 bpm. This HR trend is displayed in Figure  1. If this participant was excluded from the data analysis, the metrics drastically improved: MAE and MAPE were consistently lower than those of the Fitbit device for activities 1 to 5, and both widths of limits of agreement and bias were reduced significantly.
CCC was consistently higher in the Samsung Galaxy Watch Active2. Both devices achieved particularly low scores (<0.250) during typing and slow walking. The resting phase resulted in the highest individual activity of CCC in both devices.
Bland-Altman plots for both devices are shown in Figures 2  and 3. A large cluster of points in the top-right section of Figure  3 is particularly noticeable. These data points are a result of the previously mentioned mismeasurement of the Samsung Galaxy Watch Active2. If participant 20 is excluded from the data set, the respective cluster disappears from the plot.

Comparison With Previous Work
Our study aimed to evaluate 2 consumer wearable devices in healthy participants over a range of activities. The results from our study indicate that both devices achieved a MAPE <10%. Although no previous work exists for the Samsung Galaxy Watch Active2, our results are somewhat in line with previous validation trials for the Fitbit Charge series.
A previous evaluation of the Fitbit Charge 3 by Muggeridge et al [22] used a notably different experimental protocol, emphasizing strenuous activities (with a focus on treadmill running, sprinting, and cycling). The authors report an overall MAPE of 7.37 (as compared with 9.74 in our study) and note that the device underestimates the HR by −7 bpm (here, −1.66 bpm). Overall, the study states that the Fitbit device performs poorly during high-intensity activities and results in a higher error in that area. In our study, the Fitbit device's mean bias was highest while climbing stairs and squatting, with a bias of −3.99 bpm and −12.52 bpm, respectively.
Reviewing studies on the Fitbit Charge 2, underestimations of the HR have been reported by several other studies [19,21,43]. The study by Baek et al [44] only reported this underestimation in the <100 bpm category and an overestimation of >120 bpm. With respect to the MAPE, the study by Reddy et al [19] reported a value of 11.33%, and the study by Nelson et al [43] reported a value of 5.96% across all activities. Our measured CCC of 0.805 across all activities is lower than the CCC of 0.906 reported by Nelson et al [43] in a 24-hour period.

Measurement Validity
Different validation definitions exist in the literature. Some prior studies have used an error rate of +5% to -5% as a limit, as it "approximates a widely accepted standard for statistical significance [...]" [24] and is "widely accepted" [19]. A limit of +10% to -10% is established by various organizations and institutions and has been equally used by other validation studies [43]. The latter value is also proposed by previously mentioned validation guidelines [12] and thus, used for further reference in our work.
Similarly, different interpretations of correlation coefficients have been used in the literature. Owing to the large number of different definitions, ranging from a weak or poor interpretation starting between <0.2, <0.50, and <0.9 [23,43,45], we refrain from the use of an exact definition.
In our study and across all activities, both devices achieved a MAPE <10% and, per definition, produced valid results. With respect to individual activities, neither device produced valid results for seated rest and typing activities. Furthermore, the Fitbit Charge 4 did not record valid data for the standing up and squat activities, and the Samsung Galaxy Watch Active2 produced invalid results for laying down in either of the 2 evaluated positions.

Participants and Demographic Structure
Our study mainly included healthy young participants aged between 20 and 36 years. Wearables may provide different validation results for older participants, particularly with respect to their skin properties and changes in the PPG curve. Furthermore, as most of our participants were local university students in middle Europe, Fitzpatrick skin types 1 to 3 were overrepresented in our study.

Selected Activities
The overall duration of individual activities was rather short, mostly because our aim was to set a low burden for study participation. Although the HR of all participants increased during the study duration (especially during the second half of the study), some participants may require a longer activity duration for optimal HR adaption. A shorter activity duration makes the collected data less meaningful and results in a lower number of recorded data points, thus decreasing statistical expressiveness.
As all activities were conducted consecutively and without breaks, splits between individual recorded activities always resulted in minor transitional phases. Some participants may react faster to the instructions of the study instructor than others. This leads to additional time slack between individual activities and may cause a slight metric profusion between the 2 subsequent activity metrics.

Laboratory Conditions and Environmental Factors
Although we aimed to replicate real-life activities as much as possible, our study was still conducted in a laboratory setting. Real-life use patterns may differ from those in our study and, as such, may have an impact on the accuracy of the investigated devices. Furthermore, our study was mostly conducted during warm summer days, and our laboratory was not equipped with air conditioning. Sweat is known to have an influence on ECG electrode conductance. It may also have an impact on PPG measurements by the examined wearable devices.

Data Annotation and Exclusion
Owing to various influencing factors-mainly ECG electrode loss, heat, selected activities, and other unknown skin factors-less data than anticipated were ultimately included in our study (20/23, 87% participants). A solid baseline (ground truth) was of the utmost importance in our study. Our manual data annotation of the criterion device data underlines this effort. As the annotation affects only the criterion device, it has no impact on the data recorded by the evaluated devices and, therefore, on future studies.
For participants 2, 5, and 7, only a subset of activities was included in our statistical analysis ( Table 2). Although the respective individual activity metric averages reported in Table  4 do not include data for the respective activities, we did not exclude these individual participants for the overall metrics. This may lead to a minor bias toward resting and sedentary activities, as activities with higher physical activities were more prone to noise and, thus, data exclusion. Metrics only show minor changes if the data of these participants are excluded from the overall metric. The overall Fitbit Charge 4 MAE changed from 8.589 to 8.614 upon exclusion, and the Samsung Galaxy Watch Active2 MAE increased from 8.13 to 8.429.
The inclusion of data of participant 20 is controversial. A main argument for potential exclusion is that the data are clearly erroneous, and such data would be equally excluded in the study settings. On the other hand, faulty recordings may also occur in real-life settings. Excluding the data would lead to a positive bias in favor of the Samsung Galaxy Watch Active2 and, thus, to a nonobjective comparison. Therefore, we decided to include these data.

Conclusions
We evaluated 2 previously unvalidated wearable devices by conducting a study featuring various activities and 23 participants. Throughout the entire experimental procedure, both devices achieved results just <10% MAPE and thus, presented acceptable HR measurement capabilities. The Fitbit Charge 4 outperformed the Samsung Galaxy Watch Active2 during resting and sedentary activities, and the Samsung device was more accurate during high-intensity activities. Neither device reached sufficient accuracy during seated rest and keyboard typing.
Our study was a prequel to a larger interdisciplinary study in obstetrics. Researchers should consider the intended use of wearable devices when reviewing validation studies and evaluating their respective findings with respect to their full requirements. This is not only the case for the experimental design but also for other aspects. Accuracy may not be the only decisive factor. Features such as raw data access, battery runtime, or additional sensors may be equally relevant for individual research.