%0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e58097 %T Concordance Between Survey and Electronic Health Record Data in the COVID-19 Citizen Science Study: Retrospective Cohort Analysis %A Crull,Elizabeth %A O'Brien,Emily C %A Antiperovitch,Pavel %A Asfaw,Kirubel %A Beatty,Alexis L %A Djibo,Djeneba Audrey %A Kaul,Alan F %A Kornak,John %A Marcus,Gregory M %A Modrow,Madelaine Faulkner %A Olgin,Jeffrey E %A Orozco,Jaime %A Park,Soo %A Peyser,Noah %A Pletcher,Mark J %A Carton,Thomas W %K electronic health records %K self-report %K COVID-19 %K data accuracy %K data validation %K EHR %K cohort %K cohort analysis %K real-world data %K concordance %K internet-based %K portal %K participant %K report %K reported %D 2025 %7 28.7.2025 %9 %J JMIR Form Res %G English %X Background: Real-world data reported by patients and extracted from electronic health records (EHRs) are increasingly leveraged for research, policy, and clinical decision-making. However, it is not always obvious the extent to which these 2 data sources agree with each other. Objective: This study aimed to evaluate the concordance of variables reported by participants enrolled in an electronic cohort study and data available in their EHRs. Methods: Survey data from COVID-19 Citizen Science, an electronic cohort study, were linked to EHR data from 7 health systems, comprising 34,908 participants. Concordance was evaluated for demographics, chronic conditions, and COVID-19 characteristics. Overall agreement, sensitivity, specificity, positive predictive value, negative predictive value, and κ statistics with 95% CIs were calculated. Results: Of 34,017 participants with complete information, 62.3% (21,176/34,017) reported being female, and 62.4% (21,217/34,017) were female according to EHR data. The median age was 57 (IQR 42‐68) years. Out of 34,017 participants, 81.6% (27,744/34,017) of participants reported being White, and 79.5% (27,054/34,017) were White according to EHR data. In addition, 9.2% (3,124/34,017) of participants reported being Hispanic, and 6.6% (2,249/34,017) were Hispanic according to EHR data. Statistically significant discordance between data sources was detected for all demographic characteristics (P<.05) except the female category (P=.57) and the American Indian and Alaska Native (P=.21) and “other” race categories (P=.33). Statistically significant discordance was detected for the 2 COVID-19 traits and all baseline medical conditions except diabetes (P=.17). The starkest absolute difference between data sources was for COVID-19 vaccination, which was 48.4% according to the EHR and 97.4% according to participant report. Overall agreement was high for all demographic characteristics, although chance-corrected agreement (κ) and sensitivity were lower for the “other” race category (κ=0.31, sensitivity =26.6%), Hispanic ethnicity (κ=0.82, sensitivity=74%), and current smoker status (κ=0.54, sensitivity=49.4%). Specificity and negative predictive value (NPV) were higher than corresponding specificity and positive predictive value (PPV) for all baseline medical conditions. Sleep apnea had the highest sensitivity of all medical conditions (83.5%), and anemia had the lowest (32.8%). Chance-corrected agreement (κ) was highly variable for baseline medical conditions, ranging from 0.26 for anemia to 0.71 for diabetes. Overall and chance-corrected agreement between data sources for COVID-19 traits such as infection (84.6%, κ=0.34) and vaccination (51.0%, κ=0.05) was relatively lower than all other evaluated traits. The sensitivity for COVID-19 infection was 32.2%, and the sensitivity for COVID-19 vaccination was 49.7%. Although PPV for COVID-19 vaccination was 99.9%, the NPV was 5%. Conclusions: Results suggest the need for improvements to point-of-care capture of patient demographic traits and COVID-19 infection and vaccination history, patient education about their medical conditions, and linkage to external data sources in EHR-only pragmatic research. Further, these results indicate that additional work is required to integrate and prioritize participant-reported data in pragmatic research. Trial Registration: ClinicalTrials.gov NCT05548803; https://clinicaltrials.gov/study/NCT05548803 %R 10.2196/58097 %U https://formative.jmir.org/2025/1/e58097 %U https://doi.org/10.2196/58097