This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
There were an estimated 100,306 drug overdose deaths between April 2020 and April 2021, a three-quarter increase from the prior 12-month period. There is an approximate 6-month reporting lag for provisional counts of drug overdose deaths from the National Vital Statistics System, and the highest level of geospatial resolution is at the state level. By contrast, public social media data are available close to real-time and are often accessible with precise coordinates.
The purpose of this study is to assess whether county-level overdose mortality burden could be estimated using opioid-related Twitter data.
International Classification of Diseases (ICD) codes for poisoning or exposure to overdose at the county level were obtained from CDC WONDER. Demographics were collected from the American Community Survey. The Twitter Application Programming Interface was used to obtain tweets that contained any of the 36 terms with drug names. An unsupervised classification approach was used for clustering tweets. Population-normalized variables and polynomial population-normalized variables were produced. Furthermore,
Modeling overdose mortality with normalized demographic variables alone explained only 7.4% of the variability in county-level overdose mortality, whereas this was approximately doubled by the use of specific demographic and Twitter data covariates based on a backward selection approach. The highest adjusted
Social media data, when transformed using certain statistical approaches, may add utility to the goal of producing closer to real-time county-level estimates of overdose mortality. Prediction of opioid-related outcomes can be advanced to inform prevention and treatment decisions. This interdisciplinary approach can facilitate evidence-based funding decisions for various substance use disorder prevention and treatment programs.
Overdose from substance misuse remains a serious threat to public health in the United States. Concerns relating to overdose-related mortality have risen since the World Health Organization declared COVID-19 a global pandemic on March 11, 2020, given the negative effects of the pandemic on mental health and its potential cooccurrence with substance use disorder (SUD) [
Evidencing the ongoing severity of the national opioid public health crisis, a retrospective, multicenter study of emergency departments in Alabama, Colorado, Connecticut, North Carolina, Massachusetts, and Rhode Island from January 2018 to December 2020 found that while there was a 14% decline in all-cause emergency department visits, there was a 10.5% increase in overdose-related visits and a 28.5% increase in opioid overdose rates [
The National Poison Data System (NPDS) currently collects and monitors self-reported accidental and intentional poison exposures for use by epidemiologists, state and federal agencies, and health practitioners. However, only a small amount (<5%) of NPDS-generated alerts represent incidents of public health significance [
Importantly, the lag time for reporting SUD burden may be decreased by using natural language processing applied to high-sample social media data in infodemiology and infoveillance approaches (ie, the science of distribution and determinants of information in an electronic medium) and using these covariates to build predictive models of extant SUD burden data [
With the overdose burden growing during the COVID-19 pandemic, there is a pressing need to assess the utility of novel public health surveillance approaches that can help identify individual and community-level variations in SUD burden, specifically mortality from an overdose. Here, the objective of this retrospective infodemiology study was to incorporate demographic data with geospatially tagged social media data from Twitter to conduct an experimental modeling exercise for generating predictions of county-level overdose death rates.
To carry out the study objective, this interdisciplinary infodemiology study was conducted in five phases: (1) data collection of tweets associated with SUD-related keywords and slang terms, (2) characterization of tweet themes using unsupervised machine learning, (3) geospatial aggregation, (4) mathematical transformations of spatial patterns, and (5) statistical modeling to assess potential predictive value for overdose mortality. A visual summary of the methodology is provided in Figure S1 in
Publicly available social media posts were retrospectively collected from Twitter in October 2021 using the Twitter Academic Application Programming Interface (API). The Twitter Academic API is a product track that includes access to all API v2 endpoints to help academic researchers use Twitter data. Compared to other APIs made available by Twitter, the Academic API can obtain larger volumes of posts in a retrospective query, though the output remains a subsample of posts that are randomly selected from the larger population of posts with user-defined specifications (eg, keywords and time frames). Based on prior studies that have identified and characterized self-reported SUD behavior by users on Twitter and an unclassified Drug Enforcement Administration intelligence report on drug slang code words, a group of keywords specific to opioid and other controlled substance drug names and slang terms were used for data collection (see
List of keywords used to obtain Twitter data used in model building.
Drug class | Drug name | Slang term |
Opioid products | Morphine, Oxycodone, Vicodin, Oxymorphone, Codeine | Pain killer, Morph, Demmies, Dillies, Oxy, Miss Emma, Vikes, O bomb, Octagons, Captain Cody, Percs, Oxycet, Hibilly heroin, Oxycotton, Oxy 80s, Sizzurp, Purple drank, Blue heaven, Doors and floors, Rushbo, Waston-387 |
Other controlled substances | Xanax, Adderall, Ecstasy, or MDMAa | Xannax, Adderal, Happy pills |
Synthetic opioids | Fentanyl | Goodfella, China white, Fetty, FettyShine, Murder 8, Tango, and cash |
aMDMA: 3,4-methylenedioxy-methamphetamine.
The Biterm Topic Model (BTM) was used for unsupervised topic modeling of the corpus of tweets generated from our data collection of specific opioid and drug-related keywords. We used BTM for clustering keyword-containing tweets and omitting irrelevant topics, as well as the backward selection approach used for model building for eliminating clusters of tweets with no statistical association to county-level variation in overdose mortality. Topics outputted by BTM are based upon the cooccurrence of words (ie, “biterms”) within a corpus of tweets and are particularly useful when exploring new topics, identifying new trends, and characterizing user-generated content in the absence of training data used for supervised machine learning approaches.
BTM is based on the Dirichlet distribution, with equivalent shape parameters and a prespecified
Description and example keywords obtained from the first round Biterm Topic Model (BTM).
BTM topic number | Corpus percent | Topic description | Example keywords | Example tweet |
Topic 1 | 4.0% | Common Spanish terms; drug related keywords | “drugs”; “Xanax”; “online”; “amidon”; “pour” | “ |
Topic 2 | 11.9% | Common Spanish terms; unclear topic | “que”; “de”; “mas”; “como”; “con” | “ |
Topic 3 | 9.1% | Terms relating to Viking National Football League team | “vikes”; “Vikings”; “season”; “captain”; “team” | “ |
Topic 4 | 8.8% | Drug and drug slang terms | “perc”; “drank”; “hydros”; “Xanax”; “sizzurp” | “ |
Topic 5 | 1.8% | Common Turkish terms | “bhi”; “ko”; “pedido”; “ka”; “hai” | “ |
Topic 6 | 4.0% | Common Indonesian terms | “di”; “sa”; “kalo”; “tapi” “juga”; “jadi” | “ |
Topic 7 | 15.2% | Drug and alcohol-related slang terms | “drinking”; “blue”; “butler”; “cash”; “commons” | “ |
Topic 8 | 7.7% | Real estate terms; Hindi terms | “floors”; “jodhpur”; “kitchen”; “use”; “walls” | “ |
Topic 9 | 32.9% | General online drug selling terms | “buy”; “drugs”; “online”; “money”; “think” | “ |
Topic 10 | 4.7% | Crime-related terms | 'police', ‘murder’, 'hospital', 'blood', 'justice', 'donate' | “ |
Tweets from topics featuring drug-related terms (ie, topics 1, 4, 7, and 9) were used for further geospatial and statistical analyses (although modeling conducted with all 10 topics is presented in Table S1 in
The estimated actual rates of death were missing for 44.6% (1403/3143) of US counties. In order to create an estimate of the death data for these counties, the values of neighboring spatial features were used to impute values for the counties with missing data. Using the space-time pattern mining toolbox in ArcGIS (Esri), the estimated actual rates of death due to drug overdose were imputed for counties with missing data [
While Twitter data included posts from 2012-2021, the estimated actual death rates due to drug overdose (including imputed data) used in the model were from 2017-2019, inclusive. Linear regression was used for predictive modeling to facilitate the interpretability of the analytical output. Specifically, a series of initial models that included normalized demographic variables, normalized BTM topic counts, and normalized
The ratio of predicted to actual rates of death due to drug overdose was calculated for each county using the model with the highest R-squared produced by this study (see Table S2 in
As this study did not involve people, medical records, human tissues, or any other personally identifiable information, institutional review board approval was not required.
There were 28,400 geospatially identifiable Twitter posts containing the designated keywords, published between December 2012 and September 2021. This corpus was divided into 10 topics using the BTM algorithm. The overall UMass coherence score was –1505, within a typical range for coherence for topic modeling of short-form texts. The BTM coherence score was –1281 for topic 1, –1318 for topic 2, –1447 for topic 3, –1643 for topic 4, –1599 for topic 5, –1389 for topic 6, –1722 for topic 7, –1618 for topic 8, –1593 for topic 9, and –1505 for topic 10; indicating scores consistent with those for high-term corpuses in seminal work used to relay the cohesiveness of output from BTM [
The average actual rate of death due to drug overdose (21.62 per 100,000) was comparable with the average predicted rate of death due to drug overdose (22.00 per 100,000). In addition, the average ratio of predicted overdose to actual overdose was 1.14. Carver County in Minnesota had the highest ratio of predicted to actual rates (4.16). While the highest actual rate of death due to drug overdose was in Cabell County in West Virginia (126.73 per 100,000), the highest predicted rate of death due to drug overdose was in Queen Anne's County in Maryland (37 per 100,000), indicating a regression toward the mean for modeled output. The highest difference of 25.3 per 100,000 between predicted and actual rates was observed in Charlotte County, Florida. The actual and predicted rates of death due to drug overdose, and the ratio between the two for each county, are visualized in
A model with normalized demographic covariates alone explained only 7.4% of the spatial variability in overdose mortality, which was not improved by the use of polynomial terms. However, the model with the highest coefficient of determination did include several demographic terms. These included 2 racial covariates, suggesting that modeling spatial patterns for overdose mortality would optimally take into account race-based disparities at the community level. Specifically, in this final model, Asian race and Hispanic ethnicity were significantly negatively associated with county-level burdens of overdose mortality, indicating that areas with a relatively greater burden of overdose mortality do not include high concentrations of these racial and ethnic groups.
Model prediction of drug overdose death rates using stepwise Akaike information criterion (AIC).
Number | Initial model | Final model variables | Model AIC | Model adjusted |
Model |
1 |
Normalized demographic variables |
Median age Female population Asian population Hispanic population |
8658.0 | 0.074 | <.001 |
2 |
Normalized geospatial z scores (topics 1, 4, 7, and 9) |
Topic 1 z scores Topic 4 z scores Topic 9 z scores |
8706.0 | 0.048 | <.001 |
3 | Normalized BTM topic counts |
Topic 7 count Topic 9 count |
8787.2 | 0.001 | .12 |
4 |
Normalized demographic variables Normalized geospatial z scores (topics 1, 4, 7, and 9) |
Median age Female population Asian population Hispanic population Topic 1 z scores Topic 9 z scores |
8548.6 | 0.131 | <.001 |
5 |
Normalized demographic variables Normalized BTM topic counts (topics 1, 4, 7, and 9) |
Median age Female population Asian population Hispanic population Topic 1 count Topic 9 count |
8651.1 | 0.079 | <.001 |
6 |
Normalized geospatial z scores Normalized BTM topic counts (topics 1, 4, 7, and 9) |
Topic 1 z scores Topic 4 z scores Topic 7 z scores Topic 9 z scores Topic 1 count |
8705.1 | 0.049 | <.001 |
7 |
Normalized demographic variables Normalized geospatial z scores Normalized BTM topic counts (topics 1, 4, 7, and 9) |
Median age Female population Asian population Hispanic population Topic 1 z scores Topic 4 z scores Topic 9 z scores Topic 4 count |
8546.8 | 0.133 | <.001 |
8 |
Polynomial normalized demographic variables |
Median age Female population Male population Hispanic population Black population White population American Indian population Other race population |
8668.2 | 0.071 | <.001 |
9 |
Polynomial normalized geospatial z scores (topics 1, 4, 7, and 9) |
Topic 1 z scores Topic 4 z scores Topic 9 z scores |
8739.1 | 0.029 | <.001 |
10 |
Polynomial normalized BTM topic counts (topics 1, 4, 7, and 9) |
Topic 1 count Topic 9 count |
8784.9 | 0.003 | .04 |
11 |
Polynomial normalized demographic variables Polynomial normalized geospatial z scores (topics 1, 4, 7, and 9) |
Median age Female population Male population Hispanic population Black population White population American Indian population Native Hawaiian or Pacific Islander population Other race population Multiple race population Topic 1 z scores Topic 9 z scores |
8596.2 | 0.110 | <.001 |
12 |
Polynomial normalized demographic variables Polynomial normalized BTM topic counts (topics 1, 4, 7, and 9) |
Median age Female population Male population Hispanic population Black population White population American Indian population Other race population Topic 1 count Topic 9 count |
8662.4 | 0.075 | <.001 |
13 |
Polynomial normalized geospatial z scores Polynomial normalized BTM topic counts (topics 1, 4, 7, and 9) |
Topic 1 z scores Topic 4 z scores Topic 9 z scores Topic 9 count |
8737.2 | 0.031 | <.001 |
14 |
Polynomial normalized demographic variables Polynomial normalized geospatial z scores Polynomial normalized BTM topic counts (topics 1, 4, 7, and 9) |
Median age Female population Male population Hispanic population Black population White population American Indian population Native Hawaiian or Pacific Islander population Other race population Multiple race population Topic 1 z scores Topic 9 z scores Topic 9 count |
8593.7 | 0.112 | <.001 |
Model prediction of drug overdose death rates using normalized demographic variables, normalized z scores from geospatial analyses, and normalized Biterm Topic Model topic counts (topics 1, 4, 7, and 9; model adjusted
Coefficients | Estimate | SE | |
Intercept | –18.39 | 9.26 | .047 |
Median age | 0.57 | 0.07 | <.001 |
Female population | 38.96 | 17.73 | .03 |
Asian population | –49.75 | 10.01 | <.001 |
Hispanic population | –8.73 | 2.60 | <.001 |
Topic 1 |
–3.63 | 0.48 | <.001 |
Topic 4 |
–0.43 | 0.27 | .11 |
Topic 9 |
3.39 | 0.33 | <.001 |
Topic 4 count | 4.19×104 | 2.04×104 | .04 |
Rates of death due to drug overdose by county, United States. (A) Actual rates of death due to drug overdose (2017-2019; imputed values for counties with missing data); (B) Predicted rates of death due to drug overdose using geocoded Twitter data; (C) Ratio of predicted to actual rates of death due to drug overdose.
This study computed models for rates of overdose mortality by incorporating mathematically transformed spatial distributions based on geotagged social media posts from Twitter with SUD-related keywords. In our final model, the average predicted county-level overdose mortality was similar to the actual county-level rate of overdose mortality, 23.29 per 100,000 residents and 22.00 per 100,000 residents, respectively, and the average ratio of predicted to actual mortality was 1.22 (compared to 1.14 for modeling with the full range of topics; Table S2 in
Regression-based model-fitting enables the generation of beta coefficients such that prediction follows predetermined patterns using a priori inputs and therefore may be preferable to epidemiologists when compared to black-box approaches such as neural networks or ensemble methods. This is to say that the output generated by this approach enables a set of disease burden estimates, and therefore, despite threats to accuracy, this approach permits the generation of a baseline for which increases and decreases can be recorded, especially from social media covariates that can be updated in frequent temporal cross-sections. Previous studies have used similar techniques to predict opioid-related outcomes, including the use of demographic variables and medications dispensed to predict 2-year overdose risk for individuals on chronic opioid therapy [
Generally, results demonstrate the potential benefit of using social media data as a supplement to demographic data for enabling earlier detection of overdose mortality, potentially as granular as the census tract level, which includes the benefit of a boost to the goodness of fit. Specifically, we observed that an approach involving social media data, geospatial statistics, and mathematical transformations produced about double the model coefficient of determination when compared to an approach without these data and methods. Though a backward selection approach using user-generated Twitter data covariates to model real-world public health statistics of overdose mortality may have its limitations (discussed below), there nevertheless appears to be added utility to the incorporation of these data for analyses that endeavor toward resolute, short-term prediction. The utility of this approach may also be strengthened with the integration of statistical approaches to detect aberrations, such as the Early Aberration Reporting System, though public reports suggest that these have thus far only been applied to observe, rather than modeled, data [
During the COVID-19 pandemic, a rapid rise in opioid-related overdose deaths was observed, but reporting on this data lagged behind that of those experiencing SUD [
Findings from this exploratory study are subject to a number of limitations. First, this modeling exercise was used with existing overdose mortality data, whereas the purported benefit is to use future rounds of social media data for closer to real-time prediction. For this reason, the utility of this study lies in the linear equation generated by the final model, which does not explain a high proportion of variability. Therefore, the value added by this study is primarily in the approach demonstrated (which can be iterated upon with more comprehensive or explanatory coefficients) rather than the linear equation itself. However, it should be noted that the authors intentionally sacrificed the predictive power of the model for the interpretability of covariate-specific beta coefficients, given the utility of a linear model for inputting future cross-sections of real-time data. Notably, we observed state-level heterogeneity in model performance, indicating that surveillance efforts leveraging a linear modeling approach should consider computing sets of beta coefficients for states separately. Data collection may have been limited due to the list of search terms used in this study, which may not be complete due to the continued addition and deletion of slang terms used among those using these drugs, though the list used in this study was based on existing literature and a Drug Enforcement Administration intelligence report. Second, the outcome variable used in this study represents 3 years of overdose deaths (2017-2019), and the Twitter data predictors were collected over 9 years (2012-2021) to enable sufficient sample sizes for modeling algorithms. Though the 2 periods intersect, the purported benefit of closer to real-time modeling is limited in that Twitter data patterns do not immediately precede outcome data. Optimal temporality was not feasible in this study due to the sample size required to detect geospatial patterns after aggregation to over 3000 county bins, which is why Twitter data predictors were derived from a lengthy period of data collection. Third, estimated actual death rates due to drug overdose were imputed for counties with missing data using space-time pattern mining tools. Imputation can lead to narrower CIs with an underestimation of standard errors and an overestimation of test statistics. Additionally, this study normalized using the total county-level population rather than the number of Twitter users, as the total number of users from a given county is not made available by Twitter, and measures of this variation from Twitter APIs are unreliable due to the limitations of API calls, which produce a corpus that falls short of the sample required to assess this variation. Finally, though the model incorporates demographic predictors, the difficulties in attributing demographic characteristics to near real-time data (ie, social media posts) represent a challenge for discrepant trends across demographic groups and, practically, for updating a priori inputs that enable socioculturally sensitive prediction.
The model described in this study uses a relatively novel approach involving unsupervised topic modeling and geocoded social media posts and shows the initial feasibility of the use of infodemiology principles to generate a near-real-time prediction of overdose mortality vis a vis keyword-based Twitter activity. The results from this study, though exploratory, coupled with additional data-driven research, can facilitate evidence-based funding decisions for statewide programs that can positively impact a wide array of SUD prevention approaches, including naloxone availability to prescription drug monitoring programs.
Supplementary material.
List of International Classification of Diseases (ICD) Codes Associated With Overdose.
Akaike Info Criterion
Application Programming Interface
Biterm Topic Model
National Poison Data System
substance use disorder
This study was funded by the National Institute on Drug Abuse (award 1R21DA050689-01).
The data sets generated or analyzed during this study are available from the corresponding author upon reasonable request.
RC provided conceptualization, formal analysis, investigation, methodology, project administration, software expertise, supervision, validation, writing of the original draft, review and editing of drafts. VP conducted data curation, visualization of results, formal analysis, methodological expertise, software expertise, and writing of the original draft. AC worked on writing for the original draft, as well as reviewing subsequent drafts. T McMann provided conceptualization, project administration, visualization of results, and writing of the original draft. ZL provided methodological and software expertise. T Mackey provided conceptualization, funding acquisition, project administration, supervision, reviewed and edited manuscript drafts.
T McMann, T Mackey, and ZL are employees of the startup company S-3 Research LLC. S-3 Research is a startup funded and currently supported by the National Institutes of Health – National Institute on Drug Abuse through a Small Business Innovation and Research contract for social media research and technology commercialization. The authors report no other conflict of interest associated with this manuscript and have not been asked by any organization to be named on or to submit this manuscript.