This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
The use of web-based methods to collect population-based health behavior data has burgeoned over the past two decades. Researchers have used web-based platforms and research panels to study a myriad of topics. Data cleaning prior to statistical analysis of web-based survey data is an important step for data integrity. However, the data cleaning processes used by research teams are often not reported.
The objectives of this manuscript are to describe the use of a systematic approach to clean the data collected via a web-based platform from panelists and to share lessons learned with other research teams to promote high-quality data cleaning process improvements.
Data for this web-based survey study were collected from a research panel that is available for scientific and marketing research. Participants (N=4000) were panelists recruited either directly or through verified partners of the research panel, were aged 18 to 45 years, were living in the United States, had proficiency in the English language, and had access to the internet. Eligible participants completed a health behavior survey via Qualtrics. Informed by recommendations from the literature, our interdisciplinary research team developed and implemented a systematic and sequential plan to inform data cleaning processes. This included the following: (1) reviewing survey completion speed, (2) identifying consecutive responses, (3) identifying cases with contradictory responses, and (4) assessing the quality of open-ended responses. Implementation of these strategies is described in detail, and the Checklist for E-Survey Data Integrity is offered as a tool for other investigators.
Data cleaning procedures resulted in the removal of 1278 out of 4000 (31.95%) response records, which failed one or more data quality checks. First, approximately one-sixth of records (n=648, 16.20%) were removed because respondents completed the survey unrealistically quickly (ie, <10 minutes). Next, 7.30% (n=292) of records were removed because they contained evidence of consecutive responses. A total of 4.68% (n=187) of records were subsequently removed due to instances of conflicting responses. Finally, a total of 3.78% (n=151) of records were removed due to poor-quality open-ended responses. Thus, after these data cleaning steps, the final sample contained 2722 responses, representing 68.05% of the original sample.
Examining data integrity and promoting transparency of data cleaning reporting is imperative for web-based survey research. Ensuring a high quality of data both prior to and following data collection is important. Our systematic approach helped eliminate records flagged as being of questionable quality. Data cleaning and management procedures should be reported more frequently, and systematic approaches should be adopted as standards of good practice in this type of research.
The use of web-based methods to collect population-based data has burgeoned over the past two decades [
Web-based platforms and research panels are useful tools for recruiting and collecting information from large participant samples in a relatively short amount of time. More recently, these have become an alternative method for data collection due to COVID-19 pandemic restrictions (eg, social distancing). For example, after the start of the COVID-19 pandemic, one company with a research panel that allows researchers to reach potential participants via a web-based platform reported a 400% increase in the number of researchers using their platform [
Only a few researchers have published their recommendations to improve the integrity of web-based survey data, and a combination of different strategies is advised [
In this paper, our study team shares our experiences in data cleaning to improve data quality and integrity from a web-based survey that recruited participants via a research panel. Our interdisciplinary team used a systematic and detailed data cleaning approach prior to the analyses. The goal for this paper is to describe our team’s process and to share lessons learned, including a checklist developed by the team that other research teams could use or adapt to guide their data cleaning process.
Data for this study were collected from panelists, either directly or via verified partners, of a research panel available for scientific and marketing research. The goal of our web-based survey was to examine human papillomavirus (HPV) and HPV vaccine knowledge, beliefs, attitudes, health care experiences, vaccine uptake, vaccination intentions, and other health behavior constructs, as well as information sources, preparedness for shared decision-making, and preferences that could help inform future HPV vaccination educational interventions for age-eligible individuals. Our interdisciplinary research team was composed of individuals with academic training in biostatistics, public health, nursing, psychology, epidemiology, and behavioral oncology.
Recruitment occurred from February 25 to March 24, 2021. The target sample for the study was 4000 individuals aged 18 to 45 years, stratified with equal recruitment by the cross-tabulation of age (18-26 years vs 27-45 years) and sex at birth (male vs female). Participation was limited to individuals who were panelists, directly or through verified partners of the research panel; were aged 18 to 45 years; were living in the United States; were proficient in the English language; and had access to the internet. Our interdisciplinary team aimed to recruit a sample that was representative of the geographic as well as racial and ethnic characteristics in the United States. Florida residents and those within our cancer center’s catchment were oversampled (ie, 500 Florida residents and 3500 residents of other states) with the aim of informing future research and outreach activities. The survey was pretested by three individuals who completed the paper-based version of the survey to estimate the completion time and provide feedback on survey item wording. Based on pretesting, it was estimated that the survey would take approximately 30 minutes to complete, depending on the sequential flow of the survey (ie, skip and contingency question patterns for some individuals).
The one-time survey was programmed in Qualtrics XM [
Survey schema illustrating branching logic, survey title, and number of items.
The Scientific Review Committee at Moffitt Cancer Center and the Institutional Review Board of record (Advara) reviewed the study and approved it as exempt (Pro00047536).
Data quality is often defined in relation to aspects such as accuracy, completeness, validity, and conformity [
To identify a range of survey completion durations, a group of six individuals completed the survey. As previously mentioned, prior to survey launch, three individuals who were naïve to the survey items completed paper-and-pencil versions drafts of the survey to evaluate how long it might take potential participants to complete the questions and provide feedback on item wording. Based upon the amount of time it took these individuals to complete the survey, our team anticipated that it would take participants an average of 30 minutes to complete the survey. However, we recognized that it could take some respondents less or more time to complete it depending on their responses and the corresponding skip logic that was programmed into the survey. For example, an individual who responded that they had received the HPV vaccine would receive items relevant to prior vaccine receipt, whereas an individual who reported that they had not received the HPV vaccine would receive questions about intentions to receive the HPV vaccine. Similarly, respondents with children would receive additional questions about HPV vaccination intentions for their children, whereas childless respondents would not receive those questions. Following finalization of the survey and the Qualtrics programming, an additional three team members completed the electronic (ie, Qualtrics) version of the survey.
Completion times of the Qualtrics-programmed survey within our team ranged from 5 minutes (when mindlessly and quickly clicking through the survey, but not actually reading the items) to 10 minutes (when reading and answering quickly, but legitimately) to 28 minutes (when attending to the items and reading thoroughly). Based on these test runs, consideration of the survey length and skip logic, and our best judgement, the team decided that a 10-minute (600-second) cutoff was the least amount of time that was still realistic in which respondents could take the survey while legitimately reading the items (see Table S1 in
Consecutive identical responses (ie, “straight-lining” [
The team identified survey items that could indicate logical contradictions or extremely rare cases by carefully reviewing survey items and assessing patterns of responses for logical consistency. Depending on the survey content of a particular project, the types and numbers of questions used for assessment of conflicting answers might be different. For example, surveys might include the same question in two different locations of the survey (eg, age) to check for potential contradictory responses. During our data cleaning process, we decided to examine respondents who had indicated all of the following: (1) they were married or widowed or divorced, (2) they did not self-identify as asexual, (3) they reported that they had not ever had sexual intercourse (ie, vaginal, anal, or oral sex), and (4) they responded that they were a parent of one or more children. This group of cases was selected because we believe that it is an extremely unlikely scenario (ie, that one would be married or have a history of being married and have a child or children while reporting that they had never had sexual intercourse and did not self-identify as asexual) that is most likely due to careless answering. Records meeting those criteria were identified and removed using code in SAS software (version 9.4; SAS Institute Inc) [
Two team members independently assessed the quality of open-ended responses by checking all open-ended variables and identifying gibberish (ie, unintelligible responses), nonsensical responses (eg, responses that did not make sense in the context of the question asked), and patterns of identical responses within and across records (eg, exact same response to multiple open-ended items). Our team completed this in two steps. The first reviewer conducted a visual examination of cases that contained gibberish and duplicate responses. These records were flagged and removed. The second reviewer did the following: (1) identified nonsensical responses, (2) identified irrelevant responses, and (3) checked for repetitive patterns within and across records (ie, to identify whether different records had the same response patterns, because this could indicate that the same person may have completed multiple surveys). To do this, a team member (ie, first reviewer) exported survey records from SAS to a Microsoft Excel spreadsheet. Another team member (ie, second reviewer) located each open-ended variable column and sorted that column to inspect each of the responses provided, row by row. The team member also inspected the records column by column to identify patterns of open-ended responses across variables. Records that met the criteria outlined above were flagged. The same procedure was repeated for each open-ended variable until all open-ended variables in the codebook had been inspected. When the second reviewer had questions about whether or not responses were nonsensical, irrelevant, or repetitive, the research team discussed and resolved them by consensus. Survey records with instances of at least one of those three checks were flagged and subsequently removed from the data set.
We conducted descriptive statistics to characterize our sample before and after the quality assessment procedures (
Steps to ensure quality of responses leading to final analytic sample.
Data quality assessment steps | All records (N=4000) | |
|
Records removed, n (%) | Records left, n (%) |
Original sample | 0 (0) | 4000 (100) |
Step 1: survey duration | 648 (16.20) | 3352 (83.80) |
Step 2: consecutive identical responses | 292 (7.30) | 3060 (76.50) |
Step 3: contradictory responses | 187 (4.68) | 2873 (71.83) |
Step 4: quality of open-ended responses | 151 (3.78) | 2722 (68.05) |
Descriptive characteristics of the original and final samples.
Characteristic | Original sample (N=4000), n (%)a | Final sample (N=2722), n (%)a | |||
|
|||||
|
18-26 | 2000 (50.00) | 1381 (50.73) | ||
|
27-45 | 2000 (50.00) | 1341 (49.27) | ||
|
|||||
|
Female | 2000 (50.00) | 1523 (55.95) | ||
|
Male | 2000 (50.00) | 1199 (44.05) | ||
|
|||||
|
White | 2752 (68.80) | 1934 (71.05) | ||
|
Black or African American | 506 (12.65) | 314 (11.54) | ||
|
Other | 726 (18.15) | 470 (17.27) | ||
|
Missing | 16 (0.40) | 4 (0.15) | ||
|
|||||
|
Hispanic | 719 (17.98) | 447 (16.42) | ||
|
Non-Hispanic | 3252 (81.30) | 2266 (83.25) | ||
|
Missing | 29 (0.73) | 9 (0.33) | ||
|
|||||
|
Yes | 3719 (92.98) | 2529 (92.91) | ||
|
No | 263 (6.58) | 189 (6.94) | ||
|
Missing | 18 (0.45) | 4 (0.15) | ||
|
|||||
|
High school or less | 983 (24.58) | 661 (24.28) | ||
|
Some college or associate’s degree | 1152 (28.80) | 870 (31.96) | ||
|
Bachelor’s degree | 1000 (25.00) | 757 (27.81) | ||
|
Graduate school | 848 (21.20) | 429 (15.76) | ||
|
Missing | 17 (0.43) | 5 (0.18) | ||
|
|||||
|
0-19,999 | 521 (13.03) | 331 (12.16) | ||
|
20,000-49,999 | 917 (22.93) | 673 (24.72) | ||
|
50,000-74,999 | 765 (19.13) | 558 (20.50) | ||
|
75,000-99,999 | 649 (16.23) | 456 (16.75) | ||
|
≥100,000 or more | 1069 (26.73) | 658 (24.17) | ||
|
Missing | 79 (1.98) | 46 (1.69) | ||
|
|||||
|
Married | 2158 (53.95) | 1403 (51.54) | ||
|
Other | 1826 (45.65) | 1317 (48.38) | ||
|
Missing | 16 (0.40) | 2 (0.07) | ||
|
|||||
|
Employed | 2973 (74.33) | 1975 (72.56) | ||
|
Unemployed | 415 (10.38) | 310 (11.39) | ||
|
Other | 596 (14.90) | 433 (15.91) | ||
|
Missing | 16 (0.40) | 4 (0.15) | ||
|
|||||
|
Straight | 3242 (81.05) | 2225 (81.74) | ||
|
Other | 654 (16.35) | 441 (16.20) | ||
|
Missing | 104 (2.60) | 56 (2.06) | ||
|
|||||
|
No | 729 (18.23) | 457 (16.79) | ||
|
Yes | 3248 (81.20) | 2259 (82.99) | ||
|
Missing | 23 (0.58) | 6 (0.22) | ||
|
|||||
|
No | 2153 (53.83) | 1596 (58.63) | ||
|
Yes | 1832 (45.80) | 1123 (41.26) | ||
|
Missing | 15 (0.38) | 3 (0.11) | ||
|
|||||
|
Midwest | 811 (20.28) | 583 (21.42) | ||
|
Northeast | 680 (17.00) | 435 (15.98) | ||
|
South | 1576 (39.40) | 1072 (39.38) | ||
|
West | 933 (23.33) | 632 (23.22) |
aPercentages may not total 100% due to rounding.
We described the use and systematic application of four steps to examine the quality of responses and to clean data from a web-based survey completed by individuals who were part of a research panel, directly or through verified partners. There are several other strategies and techniques to screen and clean data (eg, missing data, stability of response patterns, outliers, and maximum long strings, among others [
Other researchers have faced similar challenges when having to screen and filter out records with low-quality data collected from web-based surveys. Recently, researchers have reported disposing as much as three-quarters of data [
We hope that our step-by-step process encourages other research teams to systematically evaluate the integrity of their web-based survey data and use approaches to appropriately manage their data. Certainly, with the increased use of web-based surveys, it is imperative to evaluate data integrity and promote reporting transparency. A recent systematic review (n=80 studies) found that only 5% of the reviewed, published, web-based survey studies reported implementing checks to identify fraudulent data [
This paper adds to the literature an applied, systematic example of data screening and management procedures that allow investigators to assess the quality of responses and eliminate invalid, fraudulent, or low-quality records. With the growing body of literature describing the application of quality assessment techniques and data cleaning approaches, this paper contributes an empirical example that could serve as a resource for other investigators and help streamline their data cleaning procedures. We have created a checklist as a tool for future studies (
Cleaning the data from our web-based survey completed by panelists was a multistep and time-consuming process. However, after having invested time and effort into these quality assessment and data cleaning steps, we are more confident about the integrity of the remaining data in our final analytic sample. The final sample for manuscripts resulting from this survey data may vary depending on scientific goals and data analysis decisions (ie, handling of missingness, among others).
There were several key lessons learned from this experience. First, screening and quality checks should be in place both before and after collecting data from web-based survey platforms and research panelists. In future web-based surveys, our team plans to include attention checks and additional items to assess conflicting answers with the hope of both decreasing and identifying the number of responses that are careless, fraudulent, or both. An example of an attention check is one published by Chandler and colleagues [
Web-based survey data collection and the use of research panels will likely continue to be used by research teams in the future. Certainly, there are pros and cons to collecting web-based survey data and recruiting participants from research panels. Developing a rigorous plan throughout the study, from survey inception and survey development to survey administration and statistical analyses; using multiple strategies for data quality checks and cleaning; and devoting time and attention can be effective components of improving data cleaning and management practice and consequently increasing the integrity of web-based survey data.
Steps to develop and pretest an electronic survey:
Provide clear instructions to participants and survey programmers
Test skip and branching logic (ie, rules to jump to other items)
Display items in a simple and logical way
Display scales as individual items rather than in a table format
Reduce the number of open-ended questions
Reduce the use of complex fill-in tables
Pretest the electronic survey in its final format for ease of administration and understanding (ideally with target population)
Pretest the electronic survey for completion time
Other (ie, other ways to tailor this checklist depending on needs and availability of data): ___________________________
Steps to prevent fraudulent responses (pre–data collection):
Add attention checks (ie, ways to identify inattentive respondents)
Add CAPTCHA or reCAPTCHA tasks
Add speeder checks (ie, ways to identify fast respondents)
Add items that can be used to verify responses or assess contradictions
Collect IP address, geolocation, device, and browser used to access the survey
Enable settings available within the web-based survey application to prevent multiple submissions and detect bots, among other issues
Choose a platform that adheres to data privacy and compliance
Other (ie, other ways to tailor this checklist depending on needs and availability of data): ___________________________
Steps to assess data integrity (post–data collection):
Assess participant survey duration
Check ranges of variables and examine responses that are clearly implausible
Identify consecutive identical responses
Identify contradictory responses
Examine quality of open-ended responses
Check IP address, geolocation, device, and browser information to identify multiple entries
Other (ie, other ways to tailor this checklist depending on needs and availability of data): ___________________________
Descriptive information for Step 1—survey duration.
Scales used to examine consecutive identical responses (Step 2).
SAS code for Step 2 (consecutive identical responses).
SAS code for Step 3 (conflicting responses).
American Statistical Association
Checklist for Reporting Results of Internet E-Surveys
human papillomavirus
principal investigator
South Carolina Clinical and Translational Science
This study was supported with funding from a Moffitt Center for Immunization and Infection Research in Cancer Award (principal investigator [PI]: SMC) and a Moffitt Merit Society Award (PI: SMC). This work has been supported, in part, by both the Participant Research, Interventions, and Measurement Core and the Biostatistics and Bioinformatics Shared Resource at the H. Lee Moffitt Cancer Center & Research Institute, a comprehensive cancer center designated by the National Cancer Institute and funded, in part, by a Moffitt Cancer Center Support Grant (P30-CA076292). NCB also acknowledges support from the South Carolina Clinical and Translational Science (SCTR) Institute at the Medical University of South Carolina. The SCTR Institute is funded by the National Center for Advancing Translational Sciences of the National Institutes of Health (grant UL1TR001450). The contents of this manuscript are solely the responsibility of the authors and do not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
The research team wishes to thank the following individuals: Katharine Head and Monica Kasting for their contribution to survey design, and Lakeshia Cousin, Aldenise Ewing, and Kayla McLymont for testing and timing early versions of the study survey.
SMC designed the study and obtained funding. JW performed data analyses. SMC, NCB, JW, and MA conceived the idea for this manuscript. MA wrote the manuscript with input from all authors. NCB, JW, CDM, CKG, STV, KJT, JYI, ARG, and SMC critically reviewed and contributed to the final version of this manuscript.
NCB served as an ad hoc reviewer in 2020 for the American Cancer Society, for which she received sponsored travel during the review meeting and a stipend of US $300. NCB received a series of small awards for conference and travel support, including US $500 from the Statistical Consulting Section of the American Statistical Association (ASA) for Best Paper Award at the 2019 Joint Statistical Meetings and the US $500 Lee Travel Award from the Caucus for Women in Statistics to support attendance at the 2018 Joint Statistical Meetings. NCB also received a Michael Kutner/ASA Junior Faculty Travel Award of US $946.60 to attend the 2018 Summer Research Conference of the Southern Regional Council on Statistics and travel support of US $708.51 plus a registration waiver from the ASA to attend and chair a session for the 2017 Symposium on Statistical Inference. Currently, NCB serves as the Vice President for the Florida Chapter of the ASA and Section Representative for the ASA Statistical Consulting Section, and on the Scientific Review Board at Moffitt Cancer Center. Previously, NCB served as the Florida ASA Chapter Representative and as the mentoring subcommittee chair for the Regional Advisory Board of the Eastern North American Region of the International Biometrics Society. JYI has received consulting fees from Flatiron Health Inc. ARG is a member of Merck & Co, Inc, advisory boards and her institution receives funds from Merck & Co, Inc, for research. SMC is an unpaid advisory board member of HPV Cancers Alliance.