This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
The COVID-19 pandemic is a substantial public health crisis that negatively affects human health and well-being. As a result of being infected with the coronavirus, patients can experience long-term health effects called long COVID syndrome. Multiple symptoms characterize this syndrome, and it is crucial to identify these symptoms as they may negatively impact patients’ day-to-day lives. Breathlessness, fatigue, and brain fog are the 3 most common continuing and debilitating symptoms that patients with long COVID have reported, often months after the onset of COVID-19.
This study aimed to understand the patterns and behavior of long COVID symptoms reported by patients on the Twitter social media platform, which is vital to improving our understanding of long COVID.
Long COVID–related Twitter data were collected from May 1, 2020, to December 31, 2021. We used association rule mining techniques to identify frequent symptoms and establish relationships between symptoms among patients with long COVID in Twitter social media discussions. The highest confidence level–based detection was used to determine the most significant rules with 10% minimum confidence and 0.01% minimum support with a positive lift.
Among the 30,327 tweets included in our study, the most frequent symptoms were brain fog (n=7812, 25.8%), fatigue (n=5284, 17.4%), breathing/lung issues (n=4750, 15.7%), heart issues (n=2900, 9.6%), flu symptoms (n=2824, 9.3%), depression (n=2256, 7.4%) and general pains (n=1786, 5.9%). Loss of smell and taste, cold, cough, chest pain, fever, headache, and arm pain emerged in 1.6% (n=474) to 5.3% (n=1616) of patients with long COVID. Furthermore, the highest confidence level–based detection successfully demonstrates the potential of association analysis and the Apriori algorithm to establish patterns to explore 57 meaningful relationship rules among long COVID symptoms. The strongest relationship revealed that patients with lung/breathing problems and loss of taste are likely to have a loss of smell with 77% confidence.
There are very active social media discussions that could support the growing understanding of COVID-19 and its long-term impact. These discussions enable a potential field of research to analyze the behavior of long COVID syndrome. Exploratory data analysis using natural language processing methods revealed the symptoms and medical conditions related to long COVID discussions on the Twitter social media platform. Using Apriori algorithm–based association rules, we determined interesting and meaningful relationships between symptoms.
COVID-19, a transmissible disease caused by the SARS-CoV-2 virus, has become a substantial public health crisis that negatively affects people’s health and well-being. Most people with COVID-19 recover entirely within weeks. However, some people still experience symptoms after their initial recovery, even those who had mild symptoms with their initial infection. Others develop new symptoms related to their COVID-19 illness. These people sometimes describe themselves as “long haulers” [
Social media has become a substantial part of our lives. People use it to connect with others and share their thoughts, emotions, and experiences about any current topic, often without revealing their identity [
According to the World Health Organization clinical case definition [
There are, however, very active ongoing social media discussions that could support the growing understanding of the illness and its long-term impact. These discussions provide the opportunity to access publicly available data from multiple individuals on Twitter to analyze long COVID symptoms. However, manually discovering the knowledge in a large volume of unstructured texts is increasingly problematic. Hence, automated natural language processing (NLP) methods have been introduced to do this task effectively and accurately [
Association rules are considered a useful tool as they offer the possibility to conduct intelligent diagnoses, extract invaluable information, and build important knowledge quickly and automatically while identifying relationships within and between variables [
This study sought to achieve the 2 goals. The first goal was to identify the symptoms and medical conditions related to long COVID that were discussed on the Twitter social media platform. The second goal was to determine the patterns of symptoms and their associations. By accomplishing these objectives, this work will ultimately help physicians identify the behavior of the patients with long COIVID. This paper provides new ideas for symptom mining and reveals the internal relationship between symptoms and their application value. Thus, the work has theoretical and practical implications.
We collected worldwide, long COVID–related, and English-language tweets between May 1, 2020, and December 31, 2021, to create our data set of about 1 million tweets. We used the
We reduced the data set to 127,848 tweets by limiting the population to those with COVID-19. To do this, we refined the tweets to ensure that all the tweets reflect personal experiences with long COVID. We first considered tweets containing the pronoun “I” and the word “covid” as we wanted to extract tweets from people with COVID-19 or long COVID. Subsequently, we removed tweets containing words that explain users’ opinions, as many people discuss long COVID without necessarily having COVID-19. The set of the words or phrases we considered is listed in
The list of words that explain users’ opinions rather than their experience.
Word or phrase | Tweets (N=148,672), n (%) |
“opinion” | 641 (0.43) |
“I believe” | 1194 (0.8) |
“I think” | 6861 (4.61) |
“I feel” | 2006 (1.35) |
“may be” OR “maybe” OR “might” | 7582 (5.1) |
“perhaps” | 750 (0.5) |
List of long COVID symptoms identified by different sources in the literature.
|
Symptom | Mayo Clinic | NHSa | CDCb | WHOc | Singh and Reddy [ |
1 | Extreme tiredness (fatigue) | ✓ | ✓ | ✓ | ✓ | ✓ |
2 | Shortness of breath or difficulty breathing | ✓ | ✓ | ✓ | ✓ | ✓ |
3 | Cough | ✓ | ✓ | ✓ | ✓ | ✓ |
4 | Joint pain | ✓ | ✓ | ✓ | ✓ |
|
5 | Chest pain or tightness | ✓ | ✓ | ✓ | ✓ | ✓ |
6 | Problems with memory and concentration (“brain fog”) | ✓ | ✓ | ✓ | ✓ | ✓ |
7 | Difficulty sleeping (insomnia) |
|
✓ | ✓ | ✓ | ✓ |
8 | Muscle pain | ✓ |
|
✓ | ✓ | ✓ |
9 | Headache | ✓ | ✓ | ✓ | ✓ | ✓ |
10 | Fast or pounding heartbeat |
✓ | ✓ | ✓ | ✓ | ✓ |
11 | Loss of smell | ✓ | ✓ | ✓ | ✓ | ✓ |
12 | Loss of taste | ✓ | ✓ | ✓ | ✓ | ✓ |
13 | Depression or anxiety |
|
|
|
|
|
14 | Fever | ✓ | ✓ |
|
✓ |
|
15 | Dizziness (light-headedness) | ✓ | ✓ | ✓ | ✓ | ✓ |
16 | Worsened symptoms after physical or mental activities | ✓ |
|
✓ |
|
|
17 | Pins-and-needles feeling |
|
✓ | ✓ | ✓ |
|
18 | Tinnitus and earaches |
|
✓ |
|
✓ | ✓ |
19 | Diarrhea |
|
✓ | ✓ | ✓ |
|
20 | Stomach aches |
|
✓ | ✓ |
|
|
21 | Loss of appetite |
|
✓ |
|
|
✓ |
22 | Sore throat |
|
✓ |
|
|
|
23 | Rash |
|
|
✓ |
|
|
24 | Mood changes |
|
|
✓ |
|
|
25 | Changes in menstrual period cycles |
|
|
✓ | ✓ |
|
26 | Abdominal pain |
|
|
|
✓ | ✓ |
27 | Neuralgias |
|
|
|
✓ |
|
28 | Allergies |
|
|
|
✓ |
|
29 | Body pain |
|
|
|
|
✓ |
30 | Nausea |
|
|
|
|
✓ |
31 | Weakness |
|
|
|
|
✓ |
32 | Numbness |
|
|
|
|
✓ |
aNHS: National Health Service.
bCDC: Centers for Disease Control and Prevention.
cWHO: World Health Organization.
Preprocessed word corpus of stemmed symptoms.
Group | Symptoms |
Brain fog | “brain fog,” “brain,” “fog,” “memori,” “mental,” “rememb,” “concentr,” “mind,” “remind,” and “focus” |
Fatigue | “fatigu,” “tire,” and “exhaust” |
Lung | “lung,” “breathless,” and “breath” |
Cannot walk | “cant walk,” “struggl walk,” “unabl walk,” “couldnt walk,” “bare walk,” “unaid walk,” and “stair walk” |
Depress | “depress,” “mood,” “stress,” and “anxieti” |
Lose weight | “lose weight” and “loss weight” |
Insomnia | “cant sleep” and “insomnia” |
Diarrhea | “diarrhea” and “diarrhoea” |
Dizziness | “dizz” and “lighthead” |
Heart | “heart,” “heart palpit,” “tachycardia,” “dysautonomia,” and “arrhythmia” |
Others | “headach,” “neck,” “arm,” “muscl pain,” “cough,” “chest pain,” “flu,” “joint pain,” “pain,” “rash,” “fever,” “loss smell,” “loss tast,” “cold,” “earach,” “vomit,” “chill,” “nausea,” “faint,” “gain weight,” “trauma,” “bodi,” “bleed,” “appetit,” “sore throat,” “pin needl,” “numb,” “tinnitu,” “buzz,” “hairfall,” “nose,” “stomach,” “menstrual,” and “abdomin” |
Data preprocessing was mainly used to clean the raw data by following specific steps to achieve better results for further evaluations. We preprocessed our initial data to ensure quality by developing a user-defined preprocessing function based on
The preprocessing plan was as follows. First, we removed the hashtag symbol and its content (eg, COVID-19, @users, and URLs) from the texts because the hashtag symbols and the URLs did not contribute to the text analysis. We also removed all non-English characters (non–American Standard Code for Information Interchange characters) because the study focused on analyzing tweets in English. We then removed repeated words and stop words identified by
Stemming reduced inflected words to their word stem, base, or root form, whereas tokenization was used to split each sentence into smaller parts of a word.
Word cloud of the list of 73 words after stemming.
Data collection and preprocessing process.
Time series plot for originally obtained data and the data considered for the study.
To measure the sentiment expressed via Twitter on long COVID, we used sentiment analysis, a specific type of NLP, computational linguistics, and text analysis [
Classification of the sentiment scores.
|
Classes | ||
|
Positive | Negative | Neutral |
Sentiment scores of all posts, % | 53.1 | 45.2 | 1.66 |
Sentiment scores of posts with at least one long COVID symptom, % | 49.8 | 48.7 | 1.53 |
At this stage, we calculated the sentiment polarity of each cleaned and preprocessed tweet using the
We also understand that symptoms often appear as more than 1 word in texts. Therefore, finding meaningful symptoms with only 2 words was a particular task in this study. Many valuable text analyses are based on the relationships between words, examining which words tend to follow others immediately or co-occur. Therefore, we analyzed the relationship between 2 words of each bigram in tweets and identified which long COVID symptoms appear as a combination of words. We used the
To identify the meaningful biwords, we used the collocation feature in words that reveal a phrase consisting of more than 1 word. Still, these words more commonly co-occur in a given context than their individual word parts. We used several bigram-association measures [
1. Pointwise mutual information (PMI): the PMI score for 2 words,
The main intuition is that it measures how much more likely the words are to co-occur than to occur independently. However, the main disadvantage of this method is that it is very sensitive to a rare combination of words. We handled this issue by using the frequency filter of words.
2. Two-tailed
versus
Where
The test statistic is as follows:
where,
We also avoided the sensitivity to rare cases by filtering the biwords according to the frequencies. Normality assumption is one of the disadvantages of using the
3. Chi-square test: The null hypothesis of the chi-square test assumes that words (
The observed frequencies (
Bigram contingency table for a bigram (
|
|
||
|
|||
|
|||
|
|
|
|
Bigram contingency table for a bigram (
In many applications, implications between different situations naturally arise. We refer to these implications as associations. These associations can be discovered and quantified using relational knowledge. “Relational knowledge identifies how concepts/entities are related and how concepts and their relations are defined or described by models” [
ARM [
We can help define LCS for future research and patient care by describing these relationships among symptoms. These patterns expose the combination of the symptoms that co-occur, as it is helpful to know how 1 symptom or set of symptoms is associated with others. An association rule between a set of symptoms X and a set of symptoms Y is expressed in the form X → Y. It is interpreted as “patients with symptoms X are likely to have symptoms Y.” Generally, the effectiveness of discovered rules is measured in terms of support, confidence, and lift.
1. Support: Support indicates how frequently the item set appears in the data set.
2. Confidence: Confidence is the percentage of all transactions satisfying X that also satisfy Y.
3. Lift: If the lift is >1, that lets us know the degree to which 2 occurrences are dependent on one another and makes these rules potentially helpful in predicting the consequences in future data sets.
Thus, based on the analysis of long COVID symptoms, we can mine the association rules among symptoms and quantify their characteristics, such as confidence, support, and lift.
Information was extracted for a total of 1,084,398 individual tweets, of which 34,022 had reported long COVID symptoms (
First 5 biwords selected from each method.
Rank | Pointwise mutual information | Chi-square test | |
1 | (runni, nose) | (brain, fog) | (brain, fog) |
2 | (pin, needl) | (mental, health) | (glandular, fever) |
3 | (shortterm, memori) | (chronic, fatigue) | (runni, nose) |
4 | (glandular, fever) | (tast, smell) | (mental, health) |
5 | (hay, fever) | (viral, fatigue) | (sore, throat) |
Since some biwords have a similar medical meaning, we grouped similar words into categories for analysis (
Word cloud of the list of 44 identified symptoms.
Relative frequency of symptoms in patients with long COVID.
The 15 most frequent long COVID symptoms by time.
Our study considered each tweet as a single transaction coming from a single individual. We applied the ARM algorithm to the symptom data considering 1 tweet as 1 transaction and identified symptom rules. The ARM algorithm using symptom transactions aimed to construct frequent item sets, having at least a user-specified threshold. Thus, we set a “confidence” threshold of 0.1 or 10%. We set up a minimum support threshold value above 0.001 and a “lift” greater than 1 for positively correlated rules. We discovered 57 significant rules for the data that included symptom-only information and presented them in
The highest confidence level–based detection was used to determine the most significant rules. Among the top 12 rules that have confidence >0.3, loss of smell and loss of taste were the most common consequent symptoms, followed by lung/breathing problems and fatigue. If a patient had lung/breathing problems and loss of taste, there was a 77% confidence that they had loss of smell. Similarly, patients with fatigue and loss of taste also had loss of smell as a consequent symptom. The top 12 rules are visualized in
List of identified symptom rules.
Rule (R) | Antecedents | Consequents | Support | Confidence | Lift |
R0 | (loss_taste, lung) | (loss_smell) | 0.0028 | 0.7748 | 14.5400 |
R1 | (loss_taste, fatigue) | (loss_smell) | 0.0028 | 0.7368 | 13.8281 |
R2 | (loss_smell, lung) | (loss_taste) | 0.0028 | 0.7107 | 15.3196 |
R3 | (loss_smell, fatigue) | (loss_taste) | 0.0028 | 0.6614 | 14.2564 |
R4 | (loss_taste, brain_fog) | (loss_smell) | 0.0018 | 0.5978 | 11.2192 |
R5 | (loss_taste) | (loss_smell) | 0.0272 | 0.5871 | 11.0173 |
R6 | (loss_smell) | (loss_taste) | 0.0272 | 0.5111 | 11.0173 |
R7 | (loss_smell, brain_fog) | (loss_taste) | 0.0018 | 0.4911 | 10.5847 |
R8 | (heart, pain) | (lung) | 0.0012 | 0.3684 | 2.3522 |
R9 | (heart, brain_fog) | (lung) | 0.0032 | 0.3542 | 2.2617 |
R10 | (ache) | (fatigue) | 0.0047 | 0.3471 | 1.9921 |
R11 | (fatigue, cough) | (lung) | 0.0012 | 0.3083 | 1.9686 |
R12 | (heart, fatigue) | (lung) | 0.0021 | 0.3014 | 1.9246 |
R13 | (headache) | (fatigue) | 0.0062 | 0.2675 | 1.5354 |
R14 | (brain_fog, lung) | (heart) | 0.0032 | 0.2574 | 2.6915 |
R15 | (heart) | (lung) | 0.0223 | 0.2328 | 1.4861 |
R16 | (lung, pain) | (fatigue) | 0.0016 | 0.2308 | 1.3245 |
R17 | (insomnia) | (fatigue) | 0.0013 | 0.2229 | 1.2791 |
R18 | (muscle_pain) | (fatigue) | 0.0019 | 0.2159 | 1.2392 |
R19 | (muscle_pain) | (ache) | 0.0019 | 0.2159 | 15.8929 |
R20 | (cough) | (lung) | 0.0061 | 0.2126 | 1.3572 |
R21 | (lung, cough) | (fatigue) | 0.0012 | 0.1989 | 1.1417 |
R22 | (chest_pain) | (lung) | 0.0053 | 0.1891 | 1.2075 |
R23 | (fever) | (fatigue) | 0.0046 | 0.1871 | 1.0737 |
R24 | (pain) | (fatigue) | 0.0102 | 0.1736 | 0.9962 |
R25 | (cold) | (flu) | 0.0064 | 0.1721 | 1.8483 |
R26 | (lung, pain) | (heart) | 0.0012 | 0.1683 | 1.7597 |
R27 | (muscle_pain) | (lung) | 0.0015 | 0.1667 | 1.0641 |
R28 | (weak) | (fatigue) | 0.0019 | 0.1633 | 0.9374 |
R29 | (trauma) | (brain_fog) | 0.0012 | 0.1586 | 0.6157 |
R30 | (ache) | (lung) | 0.0021 | 0.1553 | 0.9918 |
R31 | (fatigue, pain) | (lung) | 0.0016 | 0.1548 | 0.9886 |
R32 | (ache) | (pain) | 0.002 | 0.1505 | 2.5553 |
R33 | (heart, lung) | (brain_fog) | 0.0032 | 0.1422 | 0.5521 |
R34 | (lung) | (heart) | 0.0223 | 0.1421 | 1.4861 |
R35 | (brain_fog, lung) | (fatigue) | 0.0017 | 0.1421 | 0.8155 |
R36 | (joint_pain) | (brain_fog) | 0.0016 | 0.142 | 0.5514 |
R37 | (ache) | (headache) | 0.0019 | 0.1383 | 6.0025 |
R38 | (ache) | (muscle_pain) | 0.0019 | 0.1383 | 15.8929 |
R39 | (cough) | (fatigue) | 0.004 | 0.1371 | 0.7871 |
R40 | (arm_pain) | (fatigue) | 0.0021 | 0.135 | 0.7749 |
R41 | (muscle_pain) | (pain) | 0.0012 | 0.1326 | 2.2512 |
R42 | (brain_fog, fatigue) | (lung) | 0.0017 | 0.1274 | 0.8134 |
R43 | (muscle_pain) | (weak) | 0.0011 | 0.1212 | 10.533 |
R44 | (weak) | (lung) | 0.0014 | 0.1203 | 0.7684 |
R45 | (fatigue, lung) | (heart) | 0.0021 | 0.1198 | 1.2525 |
R46 | (pain) | (lung) | 0.0069 | 0.1165 | 0.7436 |
R47 | (ache) | (fever) | 0.0015 | 0.1141 | 4.6563 |
R48 | (joint_pain) | (pain) | 0.0013 | 0.113 | 1.9195 |
R49 | (lung) | (fatigue) | 0.0173 | 0.1107 | 0.6356 |
R50 | (depress) | (brain_fog) | 0.0081 | 0.1095 | 0.425 |
R51 | (headache) | (lung) | 0.0025 | 0.1087 | 0.6942 |
R52 | (loss_smell, loss_taste) | (lung) | 0.0028 | 0.1041 | 0.6647 |
R53 | (depress) | (fatigue) | 0.0076 | 0.1024 | 0.5877 |
R54 | (loss_smell, loss_taste) | (fatigue) | 0.0028 | 0.1017 | 0.5837 |
R55 | (joint_pain) | (ache) | 0.0012 | 0.1014 | 7.4676 |
R56 | (fatigue, lung) | (brain_fog) | 0.0017 | 0.1008 | 0.3912 |
Association rules visualization. R: rule.
The symptoms associated with LCS are still poorly understood. Analyzing social media conversations of patients related to long COVID allows us to understand the frequency and relationship between symptoms. Based on a large amount of Twitter social media data related to LCS, we performed the following 2 tasks in this paper. First, we identified the symptoms and medical conditions related to long COVID that were discussed on the Twitter social media platform. Second, we determined the patterns of symptoms and their associations.
Brain fog, fatigue, and breathing/lung issues were the 3 most common symptoms identified by the analysis. The literature sources verified these reported symptoms [
We have used a novel data source, Twitter, and multiple NLP and machine learning techniques to explore the symptoms described by a large undifferentiated population. Different NLP methods, such as sentiment analysis, keyword extraction, and lemmatization, were used to extract information and symptoms from the unstructured text data. Subsequently, we used ARM concepts [
Future research can build on this methodology with clinical data sources such as electronic medical records, adding individual covariates such as sex, age, location, and comorbidities. This research can also be further extended to detect and predict the consequences of a given set of symptoms using word popularity detection methods.
This study is based on web-based Twitter data with limited patient-level variables. No information about the demographics of the tweet authors was available. Furthermore, we only considered patients who shared their experiences with the public in English, as filtering English-language tweets from multilingual tweets is computationally intensive. Another limitation of the study is that results can be affected by misinformation or false conversations on the Twitter platform.
There is a caveat for the confidence metric in the ARM technique when a negative correlation exists between the 2 sets, for instance, ∼
The most frequent symptoms in our study included brain fog, fatigue, breathing/lung issues, heart issues, flu symptoms, and depression. General pains, loss of smell and taste, cold, cough, chest pain, fever, headache, and arm pain emerged in 1.6% to 5.3% of patients with long COVID. Furthermore, the highest confidence level–based detection with 10% minimum confidence and 0.01% minimum support successfully demonstrates the potential of association analysis and the Apriori algorithm to establish patterns to explore 57 meaningful relationship rules among long COVID symptoms. In this study, to identify the positively correlated symptoms, we only considered the rules with the lift being greater than 1. The results revealed that patients with lung/breathing problems and loss of taste are likely to have loss of smell with 77% confidence.
association rule mining
long COVID syndrome
natural language processing
Natural Language Toolkit
Pointwise mutual information
This study was funded through a Canadian Institutes of Health Research Operating Grant: Emerging COVID-19 Research Gaps and Priorities Funding Opportunity.
None declared.