Background: Real-time air pollution monitoring is a valuable tool for public health and environmental surveillance. In recent years, there has been a dramatic increase in air pollution forecasting and monitoring research using artificial neural networks. Most prior work relied on modeling pollutant concentrations collected from ground-based monitors and meteorological data for long-term forecasting of outdoor ozone (O3), oxides of nitrogen, and fine particulate matter (PM2.5). Given that traditional, highly sophisticated air quality monitors are expensive and not universally available, these models cannot adequately serve those not living near pollutant monitoring sites. Furthermore, because prior models were built based on physical measurement data collected from sensors, they may not be suitable for predicting the public health effects of pollution exposure.
Objective: This study aimed to develop and validate models to nowcast the observed pollution levels using web search data, which are publicly available in near real time from major search engines.
Methods: We developed novel machine learning–based models using both traditional supervised classification methods and state-of-the-art deep learning methods to detect elevated air pollution levels at the US city level by using generally available meteorological data and aggregate web-based search volume data derived from Google Trends. We validated the performance of these methods by predicting 3 critical air pollutants (O3, nitrogen dioxide, and PM2.5) across 10 major US metropolitan statistical areas in 2017 and 2018. We also explore different variations of the long short-term memory model and propose a novel search term dictionary learner-long short-term memory model to learn sequential patterns across multiple search terms for prediction.
Results: The top-performing model was a deep neural sequence model long short-term memory, using meteorological and web search data, and reached an accuracy of 0.82 (F1-score 0.51) for O3, 0.74 (F1-score 0.41) for nitrogen dioxide, and 0.85 (F1-score 0.27) for PM2.5, when used for detecting elevated pollution levels. Compared with using only meteorological data, the proposed method achieved superior accuracy by incorporating web search data.
Conclusions: The results show that incorporating web search data with meteorological data improves the nowcasting performance for all 3 pollutants and suggest promising novel applications for tracking global physical phenomena using web search data.
Web-based crowd surveillance has been used to track emergent risks to public health [- ]. Most commonly, these efforts involve the collection of web-based search queries to document acute changes in the incidence or symptom occurrence of primary infectious disease agents, such as influenza [ - ], Ebola [ ], dengue fever [ ], and COVID-19 [ ]. These methods have the potential to provide public health and medical professionals with benefits over traditional health surveillance and environmental epidemiology in their ability to capture both personal exposures and response dynamics at more sensitive spatial and temporal scales [ ].
Despite the promise of these approaches for infectious diseases, only a limited number of studies have examined how crowd surveillance approaches can be used to track environmental exposures and, less frequently, responses to noninfectious environment-mediated disease processes [- ]. The global burden of disease attributable to outdoor and indoor air pollution has been quantified by recent efforts and has increased public awareness of the severity of this public health crisis worldwide [ ]. Therefore, urban air pollution provides a key test case for the evaluation of web-based surveillance approaches for noninfectious environmental risks. The web-based surveillance approach is distinct from traditional approaches for measuring urban air pollution exposure. Therefore, it could possibly serve as a substitute to or complement the existing approaches. Traditional indicators of air pollution exposure, namely, concentrations measured at ambient monitoring sites, are widely used to assess the health effects associated with air pollution in epidemiological studies. However, the use of ambient monitoring measurements as surrogates of exposure may result in the misclassification of health responses and potential risks, especially for those not living near pollutant monitoring sites [ - ]. Moreover, ambient monitoring, by design, provides information on measured outdoor pollutant concentrations and may not necessarily reflect accurate personal exposures for individuals spending most of their time indoors or for those with preexisting biological susceptibility to air pollution. Several recent studies have focused on using smartphones within distributed air pollution sensing networks, where users record and upload local air pollution conditions to crowd-generated, geospatially refined pollution maps [ - ]. These studies demonstrate the feasibility of web-based crowd-generated participation in projects predicted on urban air pollution awareness.
To the best of our knowledge, few studies have investigated the feasibility of using web search data to produce accurate “nowcasts” of urban air pollution levels in real time. Conducting accurate predictions using web search data is a challenging task with 2 major challenges. The first is the selection of search terms to comprehensively capture people’s responses. Several approaches have been proposed to select search terms. For example, some studies preliminarily prepare keywords related to the target disease and then use these keywords to filter the search terms, which is often difficult because finding related keywords could be difficult for some diseases or be costly when conducting for multiple diseases. The second is the selection of the appropriate models. Although the literature on data-driven nowcasting methods for estimating infectious disease activity is well developed from an epidemiological standpoint, the machine learning methods used lag behind the state-of-the-art methods. The nowcasting models introduced to date mainly use variations of regularized linear regressions or, less often, random forests (RFs) or support vector machines. From a machine learning perspective, the problem of disease activity estimation is most suited to a more sophisticated and time series–specific model architecture. Because of the growing volume of recorded environment-mediated disease data, the use of recurrent neural networks (RNNs) and, more specifically, their variants long short-term memory (LSTM) and gated recurrent unit networks is increasingly feasible. The vanilla LSTM model makes predictions solely relying on the time series of the search activity while ignoring the semantic information in the search query phrases. Previous studies have pointed out that search queries could be semantically related, and ignoring their correlation would lead to a decrease in model performance [, ]. Recent advances in natural language processing have led to the development of a technique called word embeddings to represent the semantic information in phrases, and fine-tuning of word embeddings has been encouraged for downstream tasks (Wu, Y, unpublished data, September 2016) [ - ]. However, there is still a lack of knowledge on incorporating both the semantic information of search queries and time series of search activities to make predictions.
In this study, we investigate web search data as an important source of a web-based crowd-based indicator. As web search data are free and broadly accessible, we posit that they could serve as a scalable means of tracking urban air pollution exposures and corresponding population-level health responses. To measure search interest, we used the freely accessible Google Trends service, which reports aggregate search volume data at a city-level geographical resolution. For this analysis, we use known health end point terms and topics, such as “difficulty breathing,” and observations (eg, “haze”) suggested by public health researchers, augmented by automatic term expansion based on semantic and temporal correlations, to estimate the levels of search activities related to air pollution, and ultimately to predict whether the pollution levels were elevated [, ].
Compared with existing air pollution classification models, this study explores the use of web search anomalies as an auxiliary signal to detect air pollution. We compared our approach with the state-of-the-art physical sensor–based models that incorporate various pollutant covariates such as historical pollutant concentrations and meteorological data . Using web search data for prediction introduces several challenges, including an unclear relationship between search interest and pollution levels and the trade-off between model complexity and convergence for the inclusion of web search data in a data-deficient scenario.
In summary, our contributions are as follows:
- We proposed a novel search term dictionary learner-LSTM (DL-LSTM) model to learn sequential patterns from broad historical records of web search data for air pollution nowcasting.
- We compared the DL-LSTM models with a variety of baseline models on the efficacy of using web search data to indicate exposure to a noninfectious environmental stressor (ie, air pollution) and demonstrate that the proposed models are effective across different experimental settings.
- We evaluated the efficacy of combining web search data and meteorological data for air pollution prediction and showed that the inclusion of web search data improves the prediction accuracy and provides a promising substitute when historical pollutant data are unavailable.
We now describe the methodology. First, we formalize our problem setting, then describe the data, and then introduce our modeling approaches.
We formalized this task as a classification problem and adapted state-of-the-art machine learning models. We constructed a multivariate autoregressive model and an RF model fit on historical air pollutant concentrations as well as search and meteorological data as baseline models. We evaluated the performance of our proposed models (described below) in comparison with the baselines in terms of prediction accuracy and other standard classification prediction metrics.
The data available to the public are not individually identifiable and therefore analysis does not involve human subjects. The International Review Board (IRB) recognizes that the analysis of de-identified, publicly available data does not constitute human subjects research and therefore does not require IRB review.
We collected daily air pollutant concentration data as well as temperature and relative humidity in the 10 largest US. metropolitan statistical areas (MSAs) from January 2007 to December 2018. We focused on 3 air pollutants: ozone (O3), nitrogen dioxide (NO2), and fine particulate matter (PM2.5). The in-situ pollutant concentrations and meteorological data such as temperature, relative humidity, and dew point temperature were retrieved from the US Environmental Protection Agency, Air Quality System, and AirNow database. To create a single daily pollutant concentration for each city, we used the median pollutant concentration from all available monitoring sites within each city to avoid outlier bias.
We collected the daily search frequency of pollution-related terms from Google Trends for the same 12-year period and cities. We created a curated list of 152 pollution-related terms based on our previous air pollution epidemiology studies and in reviewing the environmental health literature [, - ], and we downloaded the reports of trending results terms using PyTrends [ ]. For each PyTrends request, we downloaded the search history of pollution-related terms over a 6-month window with 1 overlapping month for calibration. PyTrends provided us with a search frequency scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics. Because of the PyTrends restriction, we downloaded the reports of trending results multiple times, and the search frequencies were scaled separately in each 6-month window, which required us to calibrate the search frequency for the 12-year period. We calibrated the search frequencies by joining the search logs on the overlapping periods (1 out of 6 months) for intercalibration [ ].
We investigated the available input features from meteorological data (temperature and relative humidity), historical pollutant concentrations, and web search data ().
|Input feature||Feature transformation|
|Meteorological data (Meta)|
|Pollutant concentration (Pold)|
aMet: meteorological data.
bTemp_max: maximum temperature
cTemp_mean: mean temperature
dPol: pollutant concentration.
eDay t-7,..., t-1: days preceding the prediction day t.
Missing Data Imputation and Normalization
Smoothing and interpolation are simple and efficient data imputation methods , and we applied linear interpolation to fill the missing data in historical pollutant concentration, temperature, and humidity, with a rolling window size of 3. To fill in the missing data in infrequent search terms for which Google Trends does not return a count, we used random numbers close to 0 (e-10~e-5). We normalized all the input features to standard scores by subtracting their mean values and dividing them by the respective SDs.
Search Term Expansion
As web-based search queries may reflect individual exposure to ambient air pollution, the seed terms were mostly related to symptoms, observations, and emission sources (Table S1 in). However, because an exhaustive list of user queries was not available, reliance on only expert-generated seed words may result in poor prediction because of the high mismatch rate between the user queries and our expected search words.
Query expansion is a common approach for resolving this discrepancy. A recent study  showed that the initial set of seed words could be effectively expanded through semantic and temporal correlations. Thus, for each seed word, we used Google Correlate [ ] to retrieve the top 100 correlated query terms. Then, we used the pretrained word2vec model [ ] to retrieve the vector representation of each query; phrases were mapped to the centroid of the constituent terms. A utility score was calculated for each candidate query by measuring the maximum cosine similarity between the query and seed words. Queries with a high utility score were retained, and the remaining queries were eliminated, and we empirically set the utility cutoff to 0.55. This method expanded the set of search terms for the 152 search terms to track (Table S2 in ).
Modeling and Evaluation
Given sequences of physical sensor data P = [pt-L,..., pt-1]T with the dimension of L times dp, and search interest data S = [st-L+2,..., st+1]T with the dimension of L times ds, the task is to classify day t as polluted or not, where a positive class label indicates that the air pollution was above a predefined threshold. L denotes the sequence length, and dp and ds are the number of physical sensor features and the number of search-related terms, respectively.
Autoregressive and RF Classification Models
Previous work has shown that simple autoregressive models using web search data can generate nowcast estimates for influenza-like illnesses at the US national level . We adapted autoregressive models with a logistic regression (LR) classifier for classification purposes. Furthermore, we applied elastic net regularization, which is a linear combination of l1 and l2 regularization, as proposed in previous studies [ , ]. LR+Elastic Net was implemented using the Python scikit-learn package, using cross-validation to set the model’s hyperparameters to maximize the F1-score on the validation set, with class_weight set to “balanced.”
RF is an ensemble learning model that is robust against overfitting and provides a strong baseline for the development of nonlinear predictive models . We used the scikit-learn implementation of RFs. The number of trees and maximum depth of individual trees were selected to maximize the F1-score on the validation set, with balanced class_weight for positive and negative samples.
LSTM and Its Variants
LSTM units  are RNN models designed for sequence modeling, which can learn nonlinear relationships in time series data [ ]. First, we describe a baseline LSTM model with 2 subnetworks to separate the search data and meteorological data. As shown in , there are 4 layers in the model, that is, the sequence embedding layer, LSTM layer, fully connected hidden layer, and output layer [ ].
In the left subnetwork of the LSTM model with search data as input, we propose 2 methods for capturing semantic information in search terms. The first is the LSTM semantic model (GloVe [Global Vectors for Word Representation]; LSTM-GloVe). As a variant of the vanilla LSTM model, for the sequence embedding layer of the right subnetwork in, we introduce the matrix multiplication operation to project the search values of search terms to their semantic embedding space (GloVe embeddings), as shown in equation 1.
Given the search interest data S = [s1,..., s7]T with the dimension of 7 times ds, and their GloVe embedding G = [g1,..., gdg] with the dimension of ds times dg, where dg = 50 (GloVe 50-dimensional word vectors trained on tweets ). The matrix multiplication operation is defined as
Specifically, the tensor generated by the matrix multiplication operation was then fed into the LSTM layer for further calculations. This matrix multiplication is designed specifically for the model consistency problem when introducing collinear predictors after search term expansion (STE).
The second variation of the LSTM model is the DL-LSTM model, which is theoretically based on the idea of matrix multiplication, as shown in LSTM-GloVe. However, instead of directly applying the GloVe embedding for matrix multiplication, it introduces the fine-tuning of the word embeddings via a dg by de rectified linear unit–activated fully connected layer. As shown in, the rectified linear unit–activated fully connected layer was applied to the initial GloVe embedding, where de=100 is the size of the new embedding. In this architecture, the GloVe 50-dimensional word vectors are used to initialize the search term embedding dictionary, and the matrix multiplication operation is used to transform the input embedding of search terms into the semantic embedding space [ ].
In summary, we evaluate the following models in this paper:
- LR: LR is LR classifier with elastic net regularization.
- RF: RF is RF classifier with the number of trees and maximum depth tuned for prediction.
- LSTM: The baseline LSTM model, as shown in , combines physical sensor features, if available, with the search interest volume data directly, providing a direct adaptation of RNNs to this problem without any problem-specific extensions.
- LSTM-GloVe: LSTM semantic model is a variant of the LSTM model as described in equation 1, where we control the input of search interest data (ie, 51 seed search terms vs 152 terms after STE) in this model. We refer to the variants as LSTM-GloVe and LSTM-GloVe with [w/] STE, respectively.
- DL-LSTM: The DL-LSTM model is shown in . We control the input of the search interest data (ie, 51 seed search terms vs 152 terms after STE) in this model and refer to the variants as DL-LSTM and DL-LSTM w/STE, respectively.
To tune the model parameters and validate the model performance, we split the available data into training (from January 2007 to December 2014), validation (from January 2015 to December 2016), and testing (from January 2017 to December 2018) sets. This 8-year training period provides a broad history for learning the relationship between input and output variables, and the predictive models are evaluated based on their ability to make predictions for completely unseen periods. For evaluating our model, we made predictions for each day from January 2017 to December 2018 in the test data set. The distribution of the classes in the training, validation, and test data sets is presented in. Note that the positive and negative classes are heavily imbalanced, with positive classes comprising, for instance, only 16% of the training samples when PM2.5 is the target pollutant.
|Pollutant||Negative samples||Positive samples|
bNO2: nitrogen dioxide.
cPM2.5: fine particulate matter.
As we defined this task as a classification problem, we used the standard classification evaluation metrics. We report the accuracy and F1-score of the positive class (the harmonic mean of precision and recall) of the predictions as evaluation metrics for all models. Although accuracy measures the total fraction of correct predictions and could misrepresent model performance in the presence of heavily imbalanced classes, the F1-score considers class imbalance and is, therefore, a more appropriate metric for our problem.
Where TP, TN, FP, and FN are the number of true positive samples, true negative samples, false positive samples, and false negative samples, respectively.
In this section, we first present the findings of the data exploration. Next, we present the principal findings of this study.
Insights From Collected Data
In this section, we describe the thresholds of abnormal air pollutant concentrations and present the lag between the search anomalies and air pollution.
Thresholds of Abnormal Air Pollutant Concentrations
The major MSAs chosen for this study have different distributions of pollutant concentrations over time and almost always fall below the Environmental Protection Agency standard 24-hour threshold (). However, multiple studies have shown that even at low concentrations, chronic exposure to air pollution negatively affects human health [ , ]. Therefore, calibrating a meaningful threshold for each city, especially those with generally lower levels of air pollution (eg, Miami), may be critical for adequately protecting population health. A natural way to do this may be to set the threshold to 1 SD above the mean daily pollutant concentration within each city, which was adopted in this study. The input predictors were also normalized within each city to reflect the city-level dynamics. The resulting thresholds for the 3 pollutants and cities under investigation are reported in .
|Pollutant||Los Angeles||District of Columbia||Philadelphia||Dallas||Atlanta||Boston||New York||Miami||Chicago||Houston|
bppb: parts per billion.
cNO2: nitrogen dioxide.
dPM2.5: fine particulate matter.
Lag Between Search Anomalies and Air Pollution
A previous study showed that there could be a lag between incident occurrence and Google search activity . As shown in , the normalized search frequency of the term “cough” is correlated with the concentration of NO2 in Atlanta with a certain lag of time. To determine the lag between elevated pollution levels and consequent pollution-related searches, the mean absolute Spearman correlation between pollutant concentrations and search interest data was calculated and shifted forward in time for 0, 1, 2, and 3 days. As shown in , for O3 and PM2.5, the mean absolute Spearman correlation increased with an increase in the shifted days. Considering that the task aimed to detect elevated pollution levels as soon as possible, a lag of 1 day was applied to search data. In other words, the search interest data from the current day were used to estimate whether air pollution was elevated on the previous day.
|Pollutant||Lag=0; search term (Spearman correlation)||P value||Lag=1; search term (Spearman correlation)||P value||Lag=2; search term (Spearman correlation)||P value||Lag=3; search term (Spearman correlation)||P value|
|Cough (−0.34)||<.001||Cough (−0.38)||<.001||Cough (−0.41)||<.001||Cough (−0.41)||<.001|
|Bronchitis (−0.31)||<.001||Bronchitis (−0.32)||<.001||Bronchitis (−0.33)||<.001||Bronchitis (−0.35)||<.001|
|Traffic (0.26)||<.001||Traffic (0.27)||<.001||Traffic (0.26)||<.001||Smoke (0.24)||<.001|
|Smoke (0.23)||<.001||Chest pain (−0.23)||<.001||Chest pain (−0.23)||<.001||Traffic (0.23)||<.001|
|Snoring (0.22)||<.001||Snoring (0.22)||<.001||Smoke (0.22)||<.001||Chest pain (−0.22)||<.001|
|Asthma (0.20)||<.001||Sulfate (0.20)||<.001||Sulfate (0.16)||.002||Cough (0.16)||.002|
|Sulfate (0.19)||<.001||Bronchitis (0.16)||.002||Bronchitis (0.15)||.005||COPDc (−0.16)||.003|
|Cough (0.17)||<.001||Inhaler (0.15)||.005||Cough (0.14)||.008||Bronchitis (0.14)||.008|
|Bronchitis (0.17)||.001||Cough (0.14)||.006||Inhaler (0.11)||.03||Wheezing (−0.12)||.02|
|Inhaler (0.16)||.002||Difficulty breathing (−0.12)||.02||Headache (−0.11)||.03||Headache (−0.10)||.04|
|Wildfires (0.14)||.009||COPD (−0.15)||.005||Air pollution (0.19)||<.001||Air pollution (0.18)||<.001|
|COPD (−0.11)||.03||Wildfires (0.14)||.007||COPD (−0.17)||.001||COPD (−0.18)||<.001|
|Snoring (0.11)||.03||Air pollution (0.14)||.008||Wildfires (0.14)||.009||Wildfires (0.15)||.004|
|Inhaler (0.10)||.06||Asthma attack (0.11)||.04||Respiratory illness (0.10)||.05||Sulfate (−0.11)||.03|
|Difficulty breathing (−0.09)||.08||Respiratory illness (0.10)||.05||Traffic (0.10)||.06||Traffic (0.11)||.04|
bNO2: nitrogen dioxide.
cCOPD: chronic obstructive pulmonary disease.
dPM2.5: fine particulate matter.
In this section, we consider 3 conditions to evaluate the performance of using web search data to detect elevated pollution, that is, using only search data, using search data as auxiliary data for meteorological data, and using search data as auxiliary data for meteorological data and historical pollutant concentrations.
Using Only Search Data
For areas where ambient pollution monitoring is unavailable, investigating whether web search data can be used as the only signal for nowcasting elevated air pollution is a vital question. When relying only on search data for air pollution prediction, both the proposed DL-LSTM architecture and STE contribute to the improvement of prediction accuracy. As shown in the “Search” section of, the LSTM-based models exhibited superior accuracy over the baseline LR and RF models for O3 and NO2. For PM2.5, the proposed models did not perform better than the baseline LR or LSTM model because the validation and test data sets were heavily imbalanced ( ). The proposed DL-LSTM w/STE model achieved the highest F1-score (32.44% for O3 and 27.70% for NO2) for detecting O3 and NO2 pollution.
|Features and model||O3a, accuracy (F1-score; %)||NO2b, accuracy (F1-score; %)||PM2.5c, accuracy (F1-score; %)|
|No prior knowledge|
|All positives||13.46 (23.73)||13.18 (23.28)||7.35 (13.69)|
|All negatives||86.54 (0.0)||86.82 (0.0)||92.65 (0.0)|
|Random (prob of positive=0.5)||50.29 (20.63)||50.56 (20.68)||50.65 (12.67)|
|LRd||36.93 (17.77)||53.97 (24.17)||78.29 (10.72)|
|RFe||33.53 (23.36)||55.22 (18.1)||92.65f(0.0)|
|LSTMg||46.73 (23.63)||69.68 (21.62)||89.96 (7.58)|
|LSTM-GloVeh||53.23 (28.45)||63.44 (27.4)||90.09 (3.73)|
|LSTM-GloVe w/STEi||69.17 (28.04)||46.85 (26.51)||91.73 (1.31)|
|DL-LSTMj||62.46 (30.4)||65.99 (26.19)||88.61 (7.97)|
|DL-LSTM w/STE||69.61 (32.44)||56.84 (27.7)||87.59 (6.99)|
|LR||62.57 (39.81)||63.64 (37.25)||58.58 (22)|
|RF||78.76 (50.59)||71.77 (39.88)||73.78 (24.67)|
|LSTM||76.54 (48.29)||72.52 (41.27)||67.89 (24.69)|
|LR||55.99 (36.56)||62 (36.25)||61.25 (21.5)|
|RF||81.39 (45.35)||73.77 (38.71)||87.96 (23.78)|
|LSTM||78.18 (47.65)||77.75 (40.31)||88.14 (21.29)|
|LSTM-GloVe||80.04 (49.37)||72.75 (40.35)||85.38 (26.99)|
|LSTM-GloVe w/STE||81.85 (50.71)||74.21 (41.49)||85.42 (26.13)|
|DL-LSTM||77.97 (48.94)||74.81 (40.53)||84.94 (24.07)|
|DL-LSTM w/STE||80.16 (49.32)||72.99 (40.34)||87.04 (21.32)|
|LR||67.38 (44.61)||70.05 (44.09)||74.45 (32.82)|
|RF||82.81 (57.23)||80.35 (51.24)||86.45 (40.63)|
|LSTM||86.97 (63.01)||84.64 (55.59)||85.25 (43.19)|
|LR||66.91 (43.71)||69.13 (43.6)||74.45 (32.82)|
|RF||82.76 (55.91)||78.91 (47.72)||89.43 (37.57)|
|LSTM||87.11 (61.54)||84.71 (54.02)||90.74 (44.81)|
|LSTM-GloVe||87.94 (63.81)||82.98 (53.78)||88.19 (46.55)|
|LSTM-GloVe w/STE||87.63 (63.83)||83.81 (54.59)||88.24 (46.51)|
|DL-LSTM||87.30 (63.02)||82.65 (53.65)||89.66 (47.35)|
|DL-LSTM w/STE||87.60 (63.61)||83.40 (53.58)||89.25 (46.59)|
bNO2: nitrogen dioxide.
cPM2.5: fine particulate matter.
dLR: logistic regression.
eRF: random forest.
fThis high accuracy is simply due to class imbalance; this model always predicts negative class, and the corresponding F1-score is 0.
gLSTM: long short-term memory.
hGloVe: Global Vectors for Word Representation.
iSTE: search term expansion.
jDL-LSTM: dictionary learner-long short-term memory.
Using Search Data and Meteorological Data
When meteorological data were available, we investigated the feasibility of using meteorological data with or without search activity data to nowcast air pollution under this condition. As shown in the “Met” and “Met+Search” sections of, the inclusion of web search data improves the nowcasting accuracy for all 3 pollutants. In addition, the LSTM-GloVe w/STE model achieved the highest F1-score (50.71% for O3 and 41.49% for NO2) for the detection of O3 and NO2 pollution. The LSTM-GloVe without STE model achieved the highest F1-score (26.99%) for detecting PM2.5 pollution.
Using Search Data, Meteorological Data, and Historical Pollutant Concentration
When historical pollution concentration is available, search activity data are added as auxiliary data to both meteorological data and historical pollution data. As shown in the “Met+Pol” and “Met+Pol+Search” sections of, the inclusion of web search data improves the nowcasting accuracy for O3 and PM2.5. However, for NO2, the inclusion of web search data does not improve the nowcasting accuracy, which indicates that increases in NO2 concentrations may not be directly noticeable by people sufficiently to increase their search interest. This difference in the performance for different pollutants and locations merits further investigation.
City-Level Analysis of O3 Pollution Prediction
We investigated the potential of using search interest and meteorological data to replace ground-based O3 sensor data for predicting O3 pollution in individual cities. As shown in, including search interest data (Met+Search) to augment purely meteorological data (Met) increases both the accuracy and F1-score metrics for most cities. Although these metrics do not reach performance when ground-level pollution sensors are available (Met+Pol), at least for two of the major MSAs (Philadelphia and Houston), search volume data indeed provides a useful alternative to pollution monitors, with only 1.6% and 0.14% degradation in accuracy, respectively. In addition, the differences in model performance across different cities indicate that web-based search patterns could vary from city to city. As shown in , the top 5 correlated terms differ across US cities over 10 years. The variation in search patterns could lead to degraded prediction performance in certain areas, leaving promising directions for improvement.
|Features||Los Angeles||District of Columbia||Philadelphia||Dallas||Atlanta||Boston||New York||Miami||Chicago||Houston|
|F1- score, %|
aMet: meteorological data.
bPol: pollution data.
|City and search term||Spearman correlation (lag=1)|
|District of Columbia|
|Shortness of breath||0.04|
Sensitivity Analysis of Air Pollution Thresholds
Classification thresholds play an important role in our model. In this study, an SD threshold from the mean of the corresponding pollutants was used as a “probability threshold” to detect air pollution at a spatial-temporal resolution. However, the proposed method is sensitive to this threshold. We further investigated the performance of the proposed method using a variety of fixed classification thresholds. As shown in- , we fixed the classification thresholds for all 10 cities to detect O3, NO2, and PM2.5 pollutions. The results show that the meteorological and search data are complementary, and combining the search and meteorological data leads to better prediction performance for all classification thresholds.
In this study, we explored various existing air pollution prediction models and found that the use of a time series neural network approach achieved the highest predictive accuracy in most of our experiments. The results showed that the LSTM-based models achieved superior accuracy for the 3 air pollutants when both meteorological data and web search data were available. Furthermore, our results on the inclusion of web search data with meteorological data indicate that under short reporting delays, the LSTM models could provide highly accurate predictions compared with baseline models using meteorological and historical pollution concentration data.
Compared with existing studies that predict urban air pollution concentrations using linear and nonlinear machine learning models [, - ], our proposed method can predict air pollution when source emissions and remotely sensed satellite data are infeasible (eg, sensed satellite data often suffer from a high missing rate owing to frequent cloud cover [ ]). Previous studies using web-based search behavior have emphasized the use of Google Trends [ , ] and applied regularized linear regression to collinear web search queries to estimate disease rates from social media or web-based search data [ , , - ]. Our research further explored the possibility of using LSTM models with semantic embeddings of search queries to predict air pollution. As shown in and , the semantic embeddings of search terms fine-tuned by the DL-LSTM model are less correlated compared with their initial GloVe embeddings, which shows that the collinearity between search terms is reduced during the training process.
We also explored various combinations of search terms and found that a comprehensive set of user queries was critical for accurately capturing people’s responses to urban air pollution. In this study, we expanded the initial set of seed terms using semantic and temporal correlations with search queries from Google Correlate. We investigated the contribution of different search term groups by manually classifying the search terms into 4 categories, where the unclassified category includes terms with ambiguous meanings.shows the accuracy and F1-score when we removed search terms by categories for predicting O3, NO2, and PM2.5 pollution. Removing the search terms in the symptom, observation, and source categories led to a decrease in the accuracy score for detecting at least two pollutants. At the same time, removing the search terms with ambiguous meaning only led to a slightly higher accuracy score for all 3 pollutants.
|Pollutant and terms||Accuracy (change; %)||F1-score (change; %)|
|All wob symptom||0.647 (−7.1)||0.3024 (−6.8)|
|All wo observation||0.622 (−10.6)||0.3264 (+0.6)|
|All wo source||0.6712 (−3.6)||0.3033 (−6.5)|
|All wo unclassified||0.7057 (+1.4)||0.3273 (+0.9)|
|All wo symptom||0.4452 (−22.0)||0.2418 (−12.7)|
|All wo observation||0.6125 (+7.8)||0.2480 (−10.5)|
|All wo source||0.5452 (−4.1)||0.2647 (−4.4)|
|All wo unclassified||0.6534 (+15.0)||0.2134 (−23.0)|
|All wo symptom||0.7897 (−9.8)||0.1029 (+47.2)|
|All wo observation||0.7496 (−14.4)||0.1049 (+50.1)|
|All wo source||0.8994 (+2.7)||0.0393 (−43.8)|
|All wo unclassified||0.8991 (+2.6)||0.0264 (−62.2)|
cNO2: nitrogen dioxide.
dPM2.5: fine particulate matter.
By analyzing the coefficients of each search term, the results show that several search terms contribute more than other search terms. The average feature importance of the seed search terms was calculated using the RF model. As shown in Figure S1, Figure S2, and Figure S3 in, search terms including “particular matter,” “rapid breathing,” and “throat irritation” have relatively high feature importance for detecting O3, NO2, and PM2.5 pollution, respectively. The results also indicated that no search terms worked best for all 3 pollutants.
A key limitation of this study is the tuning of the neural network model. First, the performance of neural network models is sensitive to several hyperparameters, including optimization choices, depth, width, and regularization. Owing to computational limitations, we adopted a simple LSTM architecture with a single 128-unit hidden layer and tuned the model using validation data sets for other hyperparameters. In addition, we noticed that stochastic components such as the random seed for the RF model and the randomness in the optimization process of LSTM models influenced the interpretation of the results. Therefore, we repeated the experiments 10 times with different random seeds for the RF and LSTM models. As the time cost of repeating LSTM models is high, we only repeated the RF, LSTM, and DL-LSTM models 10 times to predict O3 pollution with all input features. The accuracy of the DL-LSTM model is mean 0.8744 (SD 0.0046). Compared with the LSTM model (mean 0.8714, SD 0.0036), the improvement was not significant (P=.11). Compared with the RF model (mean 0.8273, SD 0.0017), the improvement was significant (P<.001). The F1-score for the DL-LSTM model is mean 0.6314 (SD 0.0058). Compared with both the LSTM (mean 0.6019, SD 0.0096) and RF models (mean 0.5588, SD 0.0024), the improvements are significant (P<.001), which shows that the results of the LSTM models are stable. There is room for further exploration of more sophisticated neural network model architectures for noninfectious disease prediction [- ]. We leave the exploration of deeper and wider architectures to future work.
Another limitation relates to the biases introduced by relying on search data, which may not reflect the underlying population demographics or experiences. Although some of these issues are alleviated automatically by training a model against ground sensor pollution levels, understanding and correcting these data biases requires further study. In the future, we plan to investigate other sources of crowd-based surveillance data, such as self-reports on social media, to augment traditional physical sensor methods, thus providing a more direct, human-centered measure of how people experience elevated air pollution levels.
In this study, we posit that although web search data cannot yet completely replace ground-based pollution monitors, it may already serve as a valuable additional signal to augment ground-based pollution data, providing significant accuracy improvements for detecting unusual spikes in air pollution. We also found that the correlation between search terms and pollution concentration varies at the city level. Therefore, the model must be fine-tuned when applied to specific cities. For model and search term selection, we used the simplest LSTM architecture with a dictionary learner module and found that no search terms worked best for all the 3 pollutants. We propose the use of our model to learn the semantic correlations between available search terms to obtain better prediction results.
This work was supported by grants from the National Institutes of Health National Library of Medicine (R21LM013014). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflicts of Interest
The descriptions of search terms, data source, and model hyperparameters.PDF File (Adobe PDF File), 107 KB
Average feature importance for detecting ozone, nitrogen dioxide, and fine particulate matter pollution using random forest models.PDF File (Adobe PDF File), 385 KB
- Brynjolfsson E, Geva T, Reichman S. Crowd-squared: amplifying the predictive power of search trend data. MIS Q 2016 Apr 4;40(4):941-961. [CrossRef]
- Fung IC, Tse ZT, Fu KW. The use of social media in public health surveillance. Western Pac Surveill Response J 2015 Jun 26;6(2):3-6 [FREE Full text] [CrossRef] [Medline]
- Hill S, Merchant R, Ungar L. Lessons learned about public health from online crowd surveillance. Big Data 2013 Oct 10;1(3):160-167 [FREE Full text] [CrossRef] [Medline]
- Broniatowski DA, Paul MJ, Dredze M. National and local influenza surveillance through Twitter: an analysis of the 2012-2013 influenza epidemic. PLoS One 2013 Dec 9;8(12):e83672 [FREE Full text] [CrossRef] [Medline]
- Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol 2015 Oct;11(10):e1004513 [FREE Full text] [CrossRef] [Medline]
- Kandula S, Hsu D, Shaman J. Subregional nowcasts of seasonal influenza using search trends. J Med Internet Res 2017 Nov 06;19(11):e370 [FREE Full text] [CrossRef] [Medline]
- Ning S, Yang S, Kou SC. Accurate regional influenza epidemics tracking using Internet search data. Sci Rep 2019 Mar 27;9(1):5238 [FREE Full text] [CrossRef] [Medline]
- Fung IC, Tse ZT, Cheung CN, Miu AS, Fu KW. Ebola and the social media. Lancet 2014 Dec 20;384(9961):2207. [CrossRef] [Medline]
- Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Negl Trop Dis 2011 May;5(5):e1206 [FREE Full text] [CrossRef] [Medline]
- Ayyoubzadeh SM, Ayyoubzadeh SM, Zahedi H, Ahmadi M, Niakan Kalhori SR. Predicting COVID-19 incidence through analysis of Google trends data in Iran: data mining and deep learning pilot study. JMIR Public Health Surveill 2020 Apr 14;6(2):e18828 [FREE Full text] [CrossRef] [Medline]
- de Nazelle A, Seto E, Donaire-Gonzalez D, Mendez M, Matamala J, Nieuwenhuijsen MJ, et al. Improving estimates of air pollution exposure through ubiquitous sensing technologies. Environ Pollut 2013 May;176:92-99 [FREE Full text] [CrossRef] [Medline]
- Devarakonda S, Sevusu P, Liu H, Iftode L, Nath B. Real-time air quality monitoring through mobile sensing in metropolitan areas. In: Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing. 2013 Aug Presented at: UrbComp '13; August 11, 2013; Chicago, IL, USA p. 1-8. [CrossRef]
- Snik F, Rietjens JH, Apituley A, Volten H, Mijling B, Di Noia A, et al. Mapping atmospheric aerosols with a citizen science network of smartphone spectropolarimeters. Geophys Res Lett 2014 Oct 27;41(20):7351-7358. [CrossRef]
- Cohen AJ, Brauer M, Burnett R, Anderson HR, Frostad J, Estep K, et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: an analysis of data from the Global Burden of Diseases Study 2015. Lancet 2017 May 13;389(10082):1907-1918 [FREE Full text] [CrossRef] [Medline]
- Zeger SL, Thomas D, Dominici F, Samet JM, Schwartz J, Dockery D, et al. Exposure measurement error in time-series studies of air pollution: concepts and consequences. Environ Health Perspect 2000 May;108(5):419-426 [FREE Full text] [CrossRef] [Medline]
- Sarnat SE, Sarnat JA, Mulholland J, Isakov V, Özkaynak H, Chang HH, et al. Application of alternative spatiotemporal metrics of ambient air pollution exposure in a time-series epidemiological study in Atlanta. J Expo Sci Environ Epidemiol 2013;23(6):593-605. [CrossRef] [Medline]
- Liang D, Golan R, Moutinho JL, Chang HH, Greenwald R, Sarnat SE, et al. Errors associated with the use of roadside monitoring in the estimation of acute traffic pollutant-related health effects. Environ Res 2018 Aug;165:210-219. [CrossRef] [Medline]
- Zou B, Lampos V, Cox I. Transfer learning for unsupervised influenza-like illness models from online search data. In: Proceedings of the 2019 World Wide Web Conference. 2019 Presented at: WWW '19; May 13-17, 2019; San Francisco, CA, USA p. 2505-2516. [CrossRef]
- Lampos V, Miller AC, Crossan S, Stefansen C. Advances in nowcasting influenza-like illness rates using search query logs. Sci Rep 2015 Aug 03;5:12760 [FREE Full text] [CrossRef] [Medline]
- Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014 Presented at: ICML '14; June 21-26, 2014; Beijing, China p. II-1764-II-1772.
- Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013 Presented at: NeurIPS '13; December 5-10, 2013; Lake Tahoe, NV, USA p. 3111-3119. [CrossRef]
- Pennington J, Socher R, Manning C. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014 Presented at: EMNLP '14; October 25–29, 2014; Doha, Qatar p. 1532-1543. [CrossRef]
- Pilotto LS, Douglas RM, Attewell RG, Wilson SR. Respiratory effects associated with indoor nitrogen dioxide exposure in children. Int J Epidemiol 1997 Aug;26(4):788-796. [CrossRef] [Medline]
- Chauhan AJ, Inskip HM, Linaker CH, Smith S, Schreiber J, Johnston SL, et al. Personal exposure to nitrogen dioxide (NO2) and the severity of virus-induced asthma in children. Lancet 2003 Jul 07;361(9373):1939-1944 [FREE Full text] [CrossRef] [Medline]
- Rybarczyk Y, Zalakeviciute R. Machine learning approaches for outdoor air quality modelling: a systematic review. Appl Sci 2018 Dec 11;8(12):2570. [CrossRef]
- Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, et al. Air pollution and mortality in the Medicare population. N Engl J Med 2017 Jun 29;376(26):2513-2522 [FREE Full text] [CrossRef] [Medline]
- Vedal S, Brauer M, White R, Petkau J. Air pollution and daily mortality in a city with low levels of pollution. Environ Health Perspect 2003 Jan;111(1):45-52 [FREE Full text] [CrossRef] [Medline]
- Sarnat JA, Sarnat SE, Flanders WD, Chang HH, Mulholland J, Baxter L, et al. Spatiotemporally resolved air exchange rate as a modifier of acute air pollution-related morbidity in Atlanta. J Expo Sci Environ Epidemiol 2013;23(6):606-615. [CrossRef] [Medline]
- Kelly FJ, Fussell JC. Air pollution and public health: emerging hazards and improved understanding of risk. Environ Geochem Health 2015 Aug;37(4):631-649 [FREE Full text] [CrossRef] [Medline]
- Sarnat JA, Russell A, Liang D, Moutinho JL, Golan R, Weber RJ, et al. Developing multipollutant exposure indicators of traffic pollution: the dorm room inhalation to vehicle emissions (DRIVE) study. Res Rep Health Eff Inst 2018 Apr(196):3-75. [Medline]
- Google Trends. Google. URL: https://support.google.com/trends/answer/4365533?hl=en [accessed 2019-08-31]
- Challet D, Bel Hadj Ayed A. Do Google Trend Data Contain More Predictability than Price Returns? SSRN J 2014 Mar 7. [CrossRef]
- Kreindler DM, Lumsden CJ. The effects of the irregular sample and missing data in time series analysis. In: Guastello SJ, Gregson RA, editors. Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using Real Data. Boca Raton, FL, USA: CRC Press; 2006.
- Google correlate. Google. URL: https://searchengineland.com/google-correlate-more-search-data-to-mine-78560 [accessed 2019-08-31]
- Kane MJ, Price N, Scotch M, Rabinowitz P. Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks. BMC Bioinformatics 2014 Aug 13;15(1):276 [FREE Full text] [CrossRef] [Medline]
- Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997 Dec 15;9(8):1735-1780. [CrossRef] [Medline]
- Elman JL. Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 1991 Sep;7(2-3):195-225. [CrossRef]
- He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015 Presented at: ICCV '15; December 7-13, 2015; Santiago, Chile p. 1026-1034. [CrossRef]
- CGAP Project-Nowcasting Air Pollution. URL: https://github.com/emory-irlab/airpollutionnowcast [accessed 2021-12-10]
- Carrière-Swallow Y, Labbé F. Nowcasting with Google trends in an emerging market. J Forecast 2013 Jul;32(4):289-298. [CrossRef]
- Chen S, Kan G, Li J, Liang K, Hong Y. Investigating China’s urban air quality using big data, information theory, and machine learning. Pol J Environ Stud 2018;27(2):565-578. [CrossRef]
- Zhao X, Zhang R, Wu JL, Chang PC. A deep recurrent neural network for air quality classification. J Inf Hiding Multimed Signal Process 2018 Mar;9(2):346-354.
- Lin Y, Mago N, Gao Y, Li Y, Chiang YY, Shahabi C, et al. Exploiting spatiotemporal patterns for accurate air quality forecasting using deep learning. In: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 2018 Presented at: SIGSPATIAL '18; November 6-9, 2018; Seattle, WA, USA p. 359-368. [CrossRef]
- Zhu W, Wang J, Zhang W, Sun D. Short-term effects of air pollution on lower respiratory diseases and forecasting by the group method of data handling. Atmos Environ 2012 May;51:29-38. [CrossRef]
- Zhang Y, Bocquet M, Mallet V, Seigneur C, Baklanov A. Real-time air quality forecasting, part I: history, techniques, and current status. Atmos Environ 2012 Dec;60:632-655. [CrossRef]
- Brokamp C, Jandarov R, Rao MB, LeMasters G, Ryan P. Exposure assessment models for elemental components of particulate matter in an urban environment: a comparison of regression and random forest approaches. Atmos Environ (1994) 2017 Mar;151:1-11 [FREE Full text] [CrossRef] [Medline]
- Cabaneros SM, Calautit JK, Hughes BR. A review of artificial neural network models for ambient air pollution prediction. Environ Model Soft 2019 Sep;119:285-304. [CrossRef]
- Misra P, Takeuchi W. Assessing population sensitivity to urban air pollution using google trends and remote sensing datasets. Int Arch Photogramm Remote Sens Spatial Inf Sci 2020 Feb 14;XLII-3/W11:93-100. [CrossRef]
- Jun SP, Yoo HS, Choi S. Ten years of research change using Google Trends: from the perspective of big data utilizations and applications. Technol Forecast Soc Change 2018 May;130:69-87. [CrossRef]
- Yang S, Santillana M, Kou SC. Accurate estimation of influenza epidemics using Google search data via ARGO. Proc Natl Acad Sci U S A 2015 Dec 24;112(47):14473-14478 [FREE Full text] [CrossRef] [Medline]
- Lampos V, Cristianini N. Tracking the flu pandemic by monitoring the social web. In: Proceedings of the 2nd International Workshop on Cognitive Information Processing. 2010 Presented at: CIP '10; June 14-16, 2010; Elba, Italy p. 411-416. [CrossRef]
- Lampos V, De Bie T, Cristianini N. Flu detector - tracking epidemics on Twitter. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. 2010 Presented at: ECML PKDD '10; September 20-24, 2010; Barcelona, Spain p. 599-602. [CrossRef]
- Lampos V, Preoţiuc-Pietro D, Cohn T. A user-centric model of voting intention from Social Media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013 Presented at: ACL '13; August 4-9, 2013; Sofia, Bulgaria p. 993-1003.
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statistical Soc B 2005 Apr;67(2):301-320. [CrossRef]
- Deng S, Wang S, Rangwala H, Wang L, Ning Y. Cola-GNN: cross-location attention based graph neural networks for long-term ILI prediction. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020 Oct 19 Presented at: CIKM '20; October 19-23, 2020; Virtual p. 245-254.
- Zou B, Lampos V, Cox I. Multi-task learning improves disease models from web search. In: Proceedings of the 2018 World Wide Web Conference. 2018 Presented at: WWW '18; April 23-27, 2018; Lyon, France p. 87-96. [CrossRef]
- Zhang Y, Yakob L, Bonsall MB, Hu W. Predicting seasonal influenza epidemics using cross-hemisphere influenza surveillance data and local internet query data. Sci Rep 2019 Mar 01;9(1):3262 [FREE Full text] [CrossRef] [Medline]
|DL-LSTM: dictionary learner-long short-term memory|
|GloVe: Global Vectors for Word Representation|
|LR: logistic regression|
|LSTM: long short-term memory|
|MSA: metropolitan statistical area|
|NO2: nitrogen dioxide|
|PM2.5: fine particulate matter|
|RF: random forest|
|RNN: recurrent neural network|
|STE: search term expansion|
Edited by A Mavragani; submitted 30.03.22; peer-reviewed by C Zhao, A Staffini, W Ceron; comments to author 21.06.22; revised version received 06.10.22; accepted 25.10.22; published 19.12.22Copyright
©Chen Lin, Safoora Yousefi, Elvis Kahoro, Payam Karisani, Donghai Liang, Jeremy Sarnat, Eugene Agichtein. Originally published in JMIR Formative Research (https://formative.jmir.org), 19.12.2022.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.