Atrial Fibrillation Detection With an Analog Smartwatch: Prospective Clinical Study and Algorithm Validation

Background Atrial fibrillation affects approximately 4% of the world’s population and is one of the major causes of stroke, heart failure, sudden death, and cardiovascular morbidity. It can be difficult to diagnose when asymptomatic or in the paroxysmal stage, and its natural history is not well understood. New wearables and connected devices offer an opportunity to improve on this situation. Objective We aimed to validate an algorithm for the automatic detection of atrial fibrillation from a single-lead electrocardiogram taken with a smartwatch. Methods Eligible patients were recruited from 4 sites in Paris, France. Electrocardiograms (12-lead reference and single lead) were captured simultaneously. The electrocardiograms were reviewed by independent, blinded board-certified cardiologists. The sensitivity and specificity of the algorithm to detect atrial fibrillation and normal sinus rhythm were calculated. The quality of single-lead electrocardiograms (visibility and polarity of waves, interval durations, heart rate) was assessed in comparison with the gold standard (12-lead electrocardiogram). Results A total of 262 patients (atrial fibrillation: n=100, age: mean 74.3 years, SD 12.3; normal sinus rhythm: n=113, age: 61.8 years, SD 14.3; other arrhythmia: n=45, 66.9 years, SD 15.2; unreadable electrocardiograms: n=4) were included in the final analysis; 6.9% (18/262) were classified as Noise by the algorithm. Excluding other arrhythmias and Noise, the sensitivity for atrial fibrillation detection was 0.963 (95% CI lower bound 0.894), and the specificity was 1.000 (95% CI lower bound 0.967). Visibility and polarity accuracies were similar (1-lead electrocardiogram: P waves: 96.9%, QRS complexes: 99.2%, T waves: 91.2%; 12-lead electrocardiogram: P waves: 100%, QRS complexes: 98.8%, T waves: 99.5%). P-wave visibility accuracy was 99% (99/100) for patients with atrial fibrillation and 95.7% (155/162) for patients with normal sinus rhythm, other arrhythmias, and unreadable electrocardiograms. The absolute values of the mean differences in PR duration and QRS width were <3 ms, and more than 97% were <40 ms. The mean difference between the heart rates from the 1-lead electrocardiogram calculated by the algorithm and those calculated by cardiologists was 0.55 bpm. Conclusions The algorithm demonstrated great diagnostic performance for atrial fibrillation detection. The smartwatch’s single-lead electrocardiogram also demonstrated good quality for physician use in daily routine care. Trial Registration ClinicalTrials.gov NCT04351386; http://clinicaltrials.gov/ct2/show/NCT04351386


Introduction
Atrial fibrillation is the most common form of arrhythmia in the world-it affects 8 million people in Europe and 5 million people in the United States [1]. Experts believe that the number of patients with atrial fibrillation will increase in the next few years to one-quarter of middle-aged adults in the United States and in Europe [2,3]. Despite improvements in its management, atrial fibrillation remains one of the major causes of stroke, heart failure, sudden death, and cardiovascular morbidity in the world [3,4].
Atrial fibrillation is associated with a 5-fold increase in the risk of stroke [5][6][7][8][9]. Large randomized controlled trials [10,11] are underway to evaluate stroke prevention by using anticoagulation treatment in patients with subclinical atrial fibrillation. Asymptomatic patients represent 32.6% to 39.4% of patients in large international registries [12,13]. Thus subclinical atrial fibrillation could represent approximately one-third of the atrial fibrillation population [14] and is admittedly found frequently among older adults [15,16].
Traditionally, atrial fibrillation is diagnosed using an electrocardiogram (ECG), a Holter monitor worn for 24 to 48 hours, an event recorder monitoring heart activity for several weeks, or implanted pacemakers or defibrillators with an atrial lead. There is trade-off in efficacy of detection of paroxysmal atrial fibrillation between short-and long-term monitoring.
Given the increasing number of patients with asymptomatic atrial fibrillation, new simple and efficient diagnostic devices are important supplements to traditional methods to allow early diagnosis [17], screening [18], or management [19]. It is therefore crucial that these devices be evaluated in clinical studies in comparison with standard 12-lead electrocardiography [20].
We aimed to validate the diagnostic performance (ie, classification into atrial fibrillation or normal sinus rhythm) and safety of a single-channel ECG smartwatch (ScanWatch, Withings) and its associated software (Scan Monitor, Withings Inc), in the detection of atrial fibrillation in comparison with reference diagnoses made by independent blind cardiologists using simultaneously recorded 12-lead ECG.

Study Design and Population
We conducted a prospective nonrandomized open-label comparative multicenter study. Cardiology in-patients and out-patients were consecutively screened from 4 sites in Paris, France. Inclusion criteria were male or female patients aged 18 years or older with atrial fibrillation or sinus rhythm, recruited with a 1:1 ratio. Patients with pacemakers, who were physically incapable of wearing a watch on their wrist, with linguistic or mental incapacity that precluded signing a written informed consent form, or vulnerable individuals (as defined by French regulation) were not included. Eligible patients were informed and provided consent prior to any study-related procedure.

Data Collection
For each patient, simultaneous 30-second single-lead ECGs were recorded with ScanWatch with embedded software (Scan Monitor) and 12-lead ECGs recorded with Schiller Cardiovit FT1 electrocardiograph (CE marked and FDA cleared [21]).
ScanWatch uses 3 dry electrodes to record a 30-second single-lead ECG that is similar to lead I of a traditional 12-lead ECG. A real-time signal captured with the watch is streamed to a smartphone app (Health Mate, Withings; for Android and iOS), stored, exported in PDF format, and classified into 4 categories-atrial fibrillation, normal sinus rhythm, noise, or other-by the proprietary algorithm. The classification is performed on 30-second recordings using features related to the visibility of P waves and the irregularity of R-R intervals. Algorithm classifications were kept on Withings servers, and investigators were blinded from it. Data from the 12-lead ECG were exported using DICOM/HL7 ECG Waveform Export software (SemaServer, version 19.02; Schiller) in accordance with manufacturer's instructions. Patient information and data were collected and reported by site staff on the study case report form.

Data Review
The 12-lead reference ECGs and single-lead smartwatch ECGs were independently reviewed by blinded, board-certified cardiologists. Each recording was reviewed by 3 reviewers. The reviewers were instructed to classify each recording into one of the following categories: (1) normal sinus rhythm, (2) atrial fibrillation, (3) supraventricular tachycardia, (4) abnormal rhythm, such as frequent premature atrial contractions, frequent premature ventricular contractions, atrial flutter, ventricular tachycardia, ventricular fibrillation, second-degree atrioventricular block type I, second-degree atrioventricular block type II, third-degree atrioventricular block, and other, or (5) uninterpretable (ie, a classification cannot be made as the strip is not adequate for reading). If there were multiple rhythms, reviewers reported all the rhythms, then classified the recording in one of the above classes with justification. If there was a discrepancy between reviewers' classifications, the diagnosis of the majority was retained. For the 3 primary rhythms (normal sinus rhythm, atrial fibrillation, and supraventricular tachycardia), if classifications differed, the final classification was determined by consensus.
Reviewers' classifications were compared to the algorithm's automatic classification of the single-channel strips collected with the smartwatch to assess atrial fibrillation detection performance. The software classified the smartwatch strips as normal sinus rhythm, atrial fibrillation, Noise, or other arrhythmia (supraventricular tachycardia and other abnormal rhythm reference diagnoses were pooled).
To assess the quality of the single-lead ECG signals generated by the smartwatch and whether the smartwatch could be used by cardiologists in clinical practice, cardiologists assessed the visibility and polarity of P, QRS, and T waves; measured the durations of PR, QRS, and QT intervals; and measured heart rate. An additional reviewer selected a well-defined beat to be later used by the reviewers for secondary outcome measures.
Diagnoses made by the reviewers from the single-channel strips generated by the smartwatch were compared with those from the 12-lead ECG.

Statistical Analysis
The primary outcomes were sensitivity and specificity of atrial fibrillation detection. Sensitivity and specificity were calculated, first, by considering all available categories, and second, by excluding Other and Noise. The reason for the second calculation is that the device is primarily intended to discriminate between atrial fibrillation and normal sinus rhythm. Calculated sensitivities and specificities are reported with their lower confidence interval bounds; their positive and negative likelihood ratios were also calculated.
Continuous variables were expressed as mean and standard deviation, or median and range, while categorical variables were expressed as numbers and proportions. Exact 95% confidence intervals of proportions were calculated with the Clopper-Pearson method. Results for the 4 classes (atrial fibrillation, normal sinus rhythm, Noise, and Other) are presented with a 4×4 confusion matrix.
Baseline characteristics were compared using the t test for sample means for normal distributions, Mann-Whitney test for sample medians otherwise, and Fisher exact test for proportions. A Shapiro-Wilk test was used to test normality.
Cardiologist agreement was measured using Cohen κ and average accuracy in comparison with the consensus diagnosis.
Sample size was calculated for sensitivities and specificities >0.9 and a statistical power >90%. All statistical tests were 2-sided with a statistical significance threshold at .05. Analyses were performed with Python (version 3.6.8) scikit-learn (version 0.23.2), statsmodels (version 0.12.2), and scipy (version 1.5.2) toolkits.

Ethics
This study was conducted in compliance with Good Clinical Practice and 1964 Declaration of Helsinki and subsequent amendments [22]. The study was approved by the French Comité de Protection des PersonnesCPP Sud-Est IV (19.06.28.65727) and registered (ClinicalTrials.gov, NCT04351386).

Population Characteristics
Between December 2019 to April 2021, 283 patients were enrolled in the study; however, 10 patients prematurely discontinued their participation, and 11 patients were excluded, which resulted in an analysis set of 262 patients ( Figure 1). Patients characteristics (men: n=160, women: n=102; age: mean 67.7 years, SD 14.8, BMI: mean 27.5, SD 5.7 kg/m 2 ) are presented in Table 1.

Automatic Atrial Fibrillation Detection Performance
Supraventricular tachycardia was not diagnosed by any of the independent cardiologist reviewers using 12-lead ECGs. No consensus meetings were needed. The average accuracy of the 3 cardiologists was 0.92 (mean Cohen κ=0.88).
Four reference ECG were labeled as Noise, 2 of which were classified as Noise by ECG Monitor, and 2 as atrial fibrillation. ECG Monitor classified 6.9% (18/262) of the recordings performed with the watch as Noise, but none was from a patient with normal sinus rhythm (Table 2). When considering all 4 categories, the sensitivity to detect atrial fibrillation was 0.77 (95% CI lower bound 0.675), and the specificity was 0.965 (95% CI lower bound 0.912). Of the 113 normal sinus rhythm diagnoses based on the 12-lead reference ECG, only 1 (0.85%) was classified as Noise by the algorithm, and 13 (13.0%) patients diagnosed with atrial fibrillation based on the 12-lead reference ECG were classified as Noise by the algorithm. When excluding the categories other arrhythmia and Noise from the calculation, the sensitivity was 0.963 (95% CI lower bound 0.894), and the specificity was 1.000 (95% CI lower bound 0.967). Inconclusive measurements (Other and Noise) occurred more frequently in patients >65 years old (odds ratio [OR] 4.34, 95% CI 1.25-15.09; P=.02) and in patients with previously diagnosed atrial fibrillation (OR 3.75, 95% CI 1.43-9.87; P=.007). While not statistically significant, in-patients and patients with hypertension or valvular heart disease tended to have more inconclusive ECGs.

Diagnostic Accuracy
The sensitivity for detecting atrial fibrillation was 0.89; the specificity was 0.912 (Table 3). The average accuracy between the 3 cardiologists reading single-channel recordings from the smartwatch was 0.785, and the average Cohen κ was 0.675, which reflected strong agreement between reviewers.

Interval Durations
Except for QT intervals, mean difference absolute values for difference in PR duration and QRS width were less than 3 ms and more than 97% of these differences were less than 40 ms (Table 5).

Heart Rate
For reviewers' assessments of heart rate, 1-lead versus 12-lead ECG had the smallest difference (mean 0.03 bpm, SD 0.46). The largest difference was observed between the heart rate calculated by the algorithm compared with that by cardiologists based on 12-lead ECG (mean 0.63 bpm, SD 7.06). The mean difference between heart rates calculated by the algorithm and those by cardiologists based on 1-lead ECG was 0.55 bpm (SD 6.46) (Figure 2

Study Strengths
Biases were limited by recruiting consecutive patients, taking the recordings simultaneously, centralizing ECG review with independent blinded reviewers, and standardizing the review process. Participants were in-and out-patients from cardiology services with multiple comorbidities. None was familiar with the device beforehand.

Performance for Automatic Atrial Fibrillation Detection by Scan Monitor Software
Only 4 reference ECGs were deemed unreadable; this may occur for multiple reasons especially movements during the measurements.
The algorithm's ability to discriminate between atrial fibrillation and normal sinus rhythm was calculated both considering all 4 initial categories (sensitivity 0.770; specificity 0.965) and excluding Other and Noise signals (sensitivity 0.963; specificity 1.000).
Misclassifications for patients normal sinus rhythm or atrial fibrillation were very rare. In particular, the rate of false positives was 2.7% (3/113) for normal sinus rhythm, and no patients with atrial fibrillation were classified as patients with normal sinus rhythm by the algorithm. Most false negatives were misclassifications of other arrhythmias, while false positives were mostly misclassifications of patients with normal sinus rhythm or atrial fibrillation as Other. This is unfortunately associated with the algorithm development process being optimized to accurately identify atrial fibrillation and distinguish atrial fibrillation from normal sinus rhythm. Therefore, automatic classification reliability on patients with arrhythmias other than atrial fibrillation may be decreased.
These results are quite similar to those from studies on other wearable devices, such as the Apple Watch (Apple Inc) [23], the Kardia Band (AliveCor) [24], and devices by other manufacturers [25][26][27][28] ( Table 6). The Apple study recruited, by far, the most patients: almost twice the number included in our study and 4 times that in [24]. All the devices had high rates of noisy or poor-quality signals (6.9% to 16.6%). For data loss, there was only 1 case (1/283, 0.4%) with ScanWatch, none (0/169, 0%) in the Kardia band study [24], and 7.6% (46/602) in Apple Watch study. Moreover, ScanWatch and Kardia band classified 11.3% (24/213) and 17.2% (29/169), respectively, of atrial fibrillation and normal sinus rhythm signals as other arrhythmias (or as unclassified or inconclusive); Apple Watch had only 2.2% (13/602) of such errors, and consequently, had higher sensitivities and specificities than ScanWatch and Kardia band. Overall, Kardia band was less accurate than ScanWatch and Apple Watch, with ScanWatch being more accurate than the Apple Watch for detecting normal sinus rhythm.
Only sparse data were found for other manufacturers. For Samsung's ECG monitor, 16.8% of ECG recordings were considered either inconclusive or of poor quality [26], similar to Kardia band and Apple Watch; sensitivity was 98.1%, and specificity was 100%. Similarly, Fitbit ECG app performances were 98.7% and 100% for correctly detecting atrial fibrillation and normal sinus rhythm, respectively [28]. Similar results were published for Amazfit [27,29]. Table 6. Performances of commercially available devices with electrocardiogram (ECG) sensors.
In a registry-based study [30] conducted in 136 and 211 European cardiology centers, major ECG abnormalities were present in the majority of patients with heart failure. Out of 1460 patients with heart failure, 1222 had major ECG abnormalities with various patterns across the heart failure types [30], and the Euroheart Failure survey showed that ECG abnormalities were present in 98% of patients with heart failure [31]. Older age and being male were associated with an increased risk of ECG abnormalities, as were history of advanced heart valve disease, chronic kidney disease, signs of heart failure decompensation, and use of diuretics and anticoagulants [31].
ECG abnormalities, such as abnormal rhythm, PR duration >250 ms, QRS interval ≥120 ms, and pathological Q waves can be found in most patients with multiple severe underlying diseases and risks factors. Similar patterns were observed in our study (Table 7). Although patients' medications were not recorded in our study, specific heart failure ECG abnormality risk factors were recorded (sex, age, valvular heart disease, kidney disease). Patients with heart failure have multiple comorbidities that affect their ECGs, and such a population can present ECG abnormalities that are too complex for our algorithm-this represents an extreme worst-case scenario for the device's algorithm-these patients alone represent 21.4% (56/262) of the study cohort.

1-Lead ECG Quality Assessment
To the best of our knowledge, this study proposes for the first time several criteria for device signal quality assessment without bias (quantitative assessment) that are based on criteria taken from clinical practice.
In total, cardiologists declared 8.8% (23/262) of the signals as uninterpretable, and 2 patients with atrial fibrillation were classified as normal sinus rhythm (2/100). These misclassifications may be explained by artifacts. However in some cases (eg, paroxystic atrial fibrillation is more difficult to diagnose with a 30-second recording), a confirmatory ECG can improve the reliability of the diagnosis.
Accuracies in the assessment of T-wave, P-wave, and QRS-complex visibilities and polarities were high (over 96%, except for T wave: 91%). The accuracy of PR, QRS, QT interval durations was good. In particular, the standard deviation of the differences fell below 20 ms for PR and QRS time intervals, which is less than half the length of the smallest graduation (1 mm) on a standard paper ECG trace (40 ms). The standard deviation of QT interval difference was slightly higher (SD 26.3 ms).
Because Apple used a different method to assess ECG waveform quality, a comparison of ECG quality between the devices was not possible.
Comparison of cardiologist-measured heart rate on the single-lead and lead I of the 12-lead reference ECG, yielded a 0 bias and a standard deviation of the difference of 0.5 bpm.
The standard deviation between automatic heart rate calculation by ScanWatch and heart rate measurements by cardiologists on lead I of a reference ECG was 7 bpm. This difference is a consequence of the different calculation methods between reviewers (mean over 10 seconds) and the algorithm (median heart rate over 30 seconds). Given the mean heart rate (mean 82.35, SD 19.9 bpm), the mean error between the heart rate measured by the device and that measured by the reviewers on the reference ECG (mean 0.63, SD 7.09 bpm) was considered acceptable (the accuracy of the detected heart rate shall be ±10% or ±5/min, whichever is greater [32]).

Limitations
Despite good diagnostic performance in discriminating between atrial fibrillation and normal sinus rhythm, arrhythmias other than atrial fibrillation were poorly identified: only 51.1% (23/45) of these signals were correctly classified. This poor performance was expected since, by design, the algorithm was trained to specifically identify atrial fibrillation and normal sinus rhythm. Flutter signals were not negligible in our population (13/45) and were misclassified as atrial fibrillation by the algorithm: they are not easily identifiable on lead I, even for a cardiologist. Nevertheless, it is encouraging that 80.0% (36/45) of Other signals were not classified as normal sinus rhythm. Unfortunately, patients' medications at the time of the measurements were not recorded: the study lacks information about drugs that may affect cardiac rhythm.
The population enrolled in the study may not reflect the real-world use of the device: the study was conducted in cardiology services, but the device is intended for home use, some of the recordings were performed while patients were supine rather than seated, and no patients were familiar with the device before measurements, which may have increased the amount of noisy data collected.
Reduced accuracy of the machine learning classification algorithm in patients with multiple risk factors and comorbidities may be explained by the fact that data sets used to develop the algorithm did not include such specific populations. As we mentioned, these patients commonly present ECG abnormalities that may affect the algorithm's performance.
Finally, only a single measurement was recorded per patient in order to limit potential bias and test the device in the worst-case scenario. In real-use conditions, a diagnosis from a cardiologist may improve using a second recording, and recording quality may also improve with patient's practice.

Conclusions
ScanWatch, with its embedded software (Scan Monitor), was able to provide high-quality ECG traces that are adequate for clinical diagnosis of atrial fibrillation (ie, accurately discriminate between atrial fibrillation and normal sinus rhythm). Additional studies and additional machine learning work will be needed to increase the algorithm's performance in distinguishing atrial fibrillation from other arrhythmias.
Such a device offers clinicians the ability to remotely monitor patients at risk of atrial fibrillation in their daily practice with a simple, accurate, and noninvasive wearable device.