This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on http://formative.jmir.org, as well as this copyright and license information must be included.
Isotretinoin, for treating cystic acne, increases the risk of miscarriage and fetal abnormalities when taken during pregnancy. The Health Canada–approved product monograph for isotretinoin includes pregnancy prevention guidelines. A recent study by the Canadian Network for Observational Drug Effect Studies (CNODES) on the occurrence of pregnancy and pregnancy outcomes during isotretinoin therapy estimated poor adherence to these guidelines. Media uptake of this study was unknown; awareness of this uptake could help improve drug safety communication.
The aim of this study was to understand how the media present pharmacoepidemiological research using the CNODES isotretinoin study as a case study.
Google News was searched (April 25-May 6, 2016), using a predefined set of terms, for mention of the CNODES study. In total, 26 articles and 3 CNODES publications (original article, press release, and podcast) were identified. The article texts were cleaned (eg, advertisements and links removed), and the podcast was transcribed. A dictionary of 1295 unique words was created using natural language processing (NLP) techniques (term frequency-inverse document frequency, Porter stemming, and stop-word filtering) to identify common words and phrases. Similarity between the articles and reference publications was calculated using Euclidian distance; articles were grouped using hierarchical agglomerative clustering. Nine readability scales were applied to measure text readability based on factors such as number of words, difficult words, syllables, sentence counts, and other textual metrics.
The top 5 dictionary words were
Media interpretation of the CNODES study varied, with differences in synonym usage and areas of focus. All articles were written above the recommended health information reading level. Analyzing media using NLP techniques can help determine drug safety communication effectiveness. This project is important for understanding how drug safety studies are taken up and redistributed in the media.
Easy access to health-related information has rapidly transformed the traditional health care delivery paradigm. Patients increasingly use the internet to seek health information and learn more about symptoms, diseases, treatments, self-management, risk mitigation strategies, and shared decision-making with their health care providers [
News media can have a significant impact on people’s perception and interpretation of scientific research. Journalists and science writers present the results from scientific publications in news articles for the public, health care providers, and policymakers, but also may influence attitudes and health behaviors [
The use of natural language processing (NLP) techniques and readability assessments can help us better understand how the media are reporting on the medical research we conduct. We used a study conducted by the Canadian Network for Observational Drug Effect Studies (CNODES) evaluating the effectiveness of one aspect of the isotretinoin Pregnancy Prevention Program in Canada [
CNODES is a network of Canadian pharmacoepidemiologists—distributed across 7 provincial sites and supported by 4 collaborative teams working across all sites—funded by the Canadian Institutes of Health Research (CIHR) to study the risks and benefits of postmarketed drugs [
Isotretinoin, a known and potent teratogen, is widely used to treat cystic acne. Fetal exposure may result in a range of severe congenital anomalies and may increase the risk of spontaneous and induced abortion [
In total, 59,271 female patients received 102,308 courses of isotretinoin therapy. Oral contraceptive use during treatment ranged from 24.3% to 32.9%. Overall, there were between 186 and 367 pregnancies during isotretinoin treatment (3.1-6.2 per 1000 isotretinoin users), depending on the method used to define pregnancy. When follow-up was extended to include the full gestational period (up to 42 weeks), there were 1473 pregnancies (24.9/1000 users) using the high specificity definition. Most of these (1331 pregnancies, or 90.4%) were lost spontaneously or terminated by medical intervention. A total of 118 live births were identified and 11 (9.3%) had a diagnosis of congenital malformation. Annual rates of pregnancy during isotretinoin therapy did not change between 1996 and 2011. The CNODES study concluded that adherence to the isotretinoin pregnancy prevention program was poor during the 15-year period [
This study examined media representation and uptake of the CNODES study on the occurrence of pregnancy and pregnancy outcomes during isotretinoin therapy. The specific objectives of this study were to use NLP and other text-analytic methods to: (1) summarize and comprehend the content of the media coverage; (2) identify relationships between the media articles; and (3) analyze the reading levels of the media articles. By obtaining these preliminary objectives, we aimed to explore potential improvements in the way we present future research.
Our overall study methodology is depicted in
Methodology schematic for our study. CMAJ:
NLP is, generally, the ability of computers to analyze and manipulate natural language text or speech to provide an understanding of the text and answer questions about its contents. Different studies have demonstrated the application of NLP to information retrieval in a variety of areas such as question answering, social media text mining, and decision support systems [
In a different study, Wang et al combined text mining techniques with statistical analysis and patient electronic health records to detect adverse drug events. They applied NLP techniques to narrative discharge summaries to identify the safety of drugs throughout their entire lifecycle [
These studies, and many more, show that NLP is an interdisciplinary area that includes a variety of computational techniques that, alone or in combination with other approaches, can perform a diverse set of tasks and applications. Along with the main purpose of this study, we leveraged various text mining techniques to analyze media articles (each technique explored in detail below):
Frequent words analysis to study the occurrence of words in each article and cluster, recognize the pattern of the most frequently used words, and investigate how the articles and clusters differ.
Term frequency-inverse document frequency (TF-IDF) weighting to calculate the closeness and/or separation between the articles through cosine similarity and Euclidean distance.
Hierarchical agglomerative clustering (HAC) to group (ie, cluster) similar articles together and to compare them with the original CNODES study.
Readability scales to calculate readability and analyze how easily the articles can be read and understood by an average reader.
NLP consists of 3 general steps: (1) text collection; (2) preprocessing; and (3) text analysis. Preprocessing is a crucial yet often undervalued part of the process and is key to the performance and accuracy of any text analysis [
The next step in preprocessing was the removal of stop words. Stop words (eg, conjunctions, prepositions, and articles) are uninformative, frequently occurring words that do not carry much meaning and do not contribute to the differentiation between documents [
The final preprocessing step was to perform stemming. Stemming is the process of connecting different words that are derivatives of the same root (eg,
There are different algorithms for stemming. In this study, we used Porter stemming [
The purpose of the frequent words analysis was to provide an overall summary of the content of the media articles and to compare the content of the different articles—and the clusters identified later in the analysis—to learn more about the texts and the areas of their focus. These findings will help to identify how and why the clusters are different and refine further analyses [
Although frequent words analysis can provide a valuable broad overview of the content of the documents, this approach does not provide much insight into the differences between documents, as common words tend to be common across all media outlets. To provide deeper insight into the relationships between media articles, we looked at how the articles might cluster together based on the content of their coverage.
The objective of article clustering was to identify patterns in coverage of the CNODES study. Using a 3-step process of TF-IDF weighting, similarity calculation, and HAC, we identified 3 potential clusters of similar media coverage and used the frequent words analysis to provide insight into how these clusters might have differed in their language and coverage choices.
We used TF-IDF weighting in our analysis to gain insight into what makes individual articles unique. TF-IDF values represent the frequency of the words in a specific document relative to the frequency of that word over the entire corpus of documents [
TF-IDF values were calculated for all unique terms (1-grams) and the combinations of 2 sequential terms (2-grams) from the corpus using the above weighting equation and stored in an
Like most information retrieval systems, we considered multiword phrases (ie, 2-grams) as some phrases can be more meaningful and informative than individual terms. For example, in our study, the phrase
A similarity measure reflects the degree of closeness between 2 articles using a single numeric value [
In this study, we chose HAC to group the similarity matrix into groups of similar documents because of the flexibility of hierarchical approaches in the desired number of clusters, its efficiency for small datasets, and the feasibility of graphical representation of the results through a tree-like structure called a dendrogram [
In agglomerative clustering, cutting branches of the dendrogram at a selected height (cut-off point) defines the resulting clusters. Selecting the best cut-off point depends on a variety of parameters such as the desired number of clusters, the granularity of the categories, or the acceptable distance between the entities within the clusters [
We used Euclidean distance in the construction of the HAC clusters as it is more appropriate in this environment than the cosine similarity, but all the similarity values presented in this study are cosine similarity.
The final objective of our analysis was to measure the readability [
There are a variety of ways to measure the readability of a text. Friedman and Hoffman-Goetz [
We used 9 well-formalized readability formulas (
Readability formulas. C: number of characters; D: number of complex words; E: number of easy (not-complex words); P: number of polysyllables; S: number of sentences; W: number of words; Y: number of syllables; AC: average number of characters per 100 words; AS: average number of sentences per 100 words.
Readability score | Score type | Key statistical features | Formula |
Flesch Reading Ease (FRES) | Numeric score (0-100) | Word length and sentence length | FRES=206.83 - 1.015 x (W/S) - 84.6 x (Y/W) |
Flesch-Kincaid Grade (FKRA) | US grade level | Word length and sentence length | FKRA=0.39 x (W/S) – 11.8 x (Y/W) – 15.59 |
Gunning Fog Index (FOG) | US grade level | Number of complex words | FOG=0.4 x [ (W/S) + 100 x (D/W)] |
Simple Measure of Gobbledygook Index | US grade level | Number of complex words | SMOG=1.0430 x √(P x 30/S) + 3.1291 |
Automated Readability Index (ARI) | US grade level | Number of characters | ARI=4.71 x (C/W) + 0.5 x (W/S) – 21.43 |
Coleman Liau Index (CLI) | US grade levela | Number of characters | CLI=0.0588 x AC + 0.296 x AS – 15.8 |
Linsear Write Index (LWI) | US grade level | Sentence length, number of polysyllables | (1) Find a 100-word sample from your writing; (2) Calculate Val=[E+(3×D)]/S; (3) If Val >20, then LWI=Val/2; (4) If Val ≤ 20, then LWI=(Val-2) / 2; |
Dale-Chall Readability Score (DCRS) | Numeric score (0-9.9) | Number of difficult words | DCRS=0.1579 x (D/S) + 0.0496 x (W/S) |
Text Standard | US grade level |
|
A voting system among the other metrics: the reading level that is most prevalent (the mode) among the other metrics calculated. |
aThe terms in the table are stemmed versions of the actual terms (for example, us represents various forms of the verb use, and pregnanc stands for pregnancy).Grade level may also be understood as the number of years of formal education needed to understand a given text, particularly when the level exceeds the typical range of US grades (e.g. 1-12). For example, grades 13-16 suggest undergraduate training, 17-18 graduate training, and 19+ professional qualification.[
In total, 29 articles, including 26 media articles and 3 CNODES reference articles, comprised the corpus of documents for this study, and were represented in a VSM. The articles were of varying length: from 13 to 51 sentences, or 227 to 1011 words. The combined vocabulary of all articles contained 7745 unique terms (out of 11,263 total terms that appeared in the entire dataset). There was an average of 35 sentences, 740 words, and 1380 syllables per article, with an average of 30.9% (229/740) of the words being complex—words with 3 or more syllables that do not belong to a list of 3000 familiar words [
Excluding those published by CNODES, only 2 articles (8% of the sources) mentioned or acknowledged CIHR, the study’s funder.
Top 10 most frequent vocabulary terms (1-grams and 2-grams).
1-grams | Frequency | Ratio | 2-grams | Frequency | Ratio |
pregnanc | 344 | 0.031 | pregnanc prevent | 74 | 0.007 |
isotretinoin | 306 | 0.027 | birth defect | 63 | 0.006 |
studi | 245 | 0.022 | birth control | 48 | 0.004 |
drug | 226 | 0.020 | pregnanc test | 40 | 0.004 |
women | 188 | 0.017 | women take | 39 | 0.003 |
us | 165 | 0.015 | prevent program | 37 | 0.003 |
birth | 163 | 0.014 | British Columbia | 35 | 0.003 |
research | 135 | 0.012 | live birth | 33 | 0.003 |
treatment | 123 | 0.011 | pregnanc rate | 33 | 0.003 |
acn | 118 | 0.010 | isotretinoin user | 31 | 0.003 |
The resulting values of cosine similarity calculations and HAC are presented in
Further examination of the nature of the articles in each cluster showed Cluster 1, in addition to the 3 CNODES publications, included national and international news websites such as Reuters, CBC, The Globe and Mail, National Post, and CTV. Cluster 1 also included health-specific websites such as Medical Daily, Medical News Today, MD Magazine, and Medscape Medical News. The articles composing Cluster 3 were from regional news websites including CBC British Columbia and The Globe and Mail British Columbia. Articles in Cluster 2 did not include traditional news media outlets, but rather health-related and general interest websites (Science Daily, Parent Herald, and Science 2.0).
Cosine similarity values (between 0 and 1) between the media articles and CNODES publications, including CMAJ article, podcast, and press release article using TF-IDF calculations. Resulting dendrogram of hierarchical agglomerative clustering. Three clusters and 2 singletons, resulting from a cutoff point of 0.5. CMAJ:
Trend of similarity (cosine similarity) between the media articles and the CNODES publications: CMAJ article, podcast, and press release.
In addition to studying the nature of the websites that published the media articles, we found that analysis of the frequent words within the clusters provides insight into how and to what extent the clusters are different.
There is an overlap between the clinically important terms and the most frequent terms for each cluster. Hence, the top 5 most frequent terms of each cluster include the phrases that are not already mentioned in the 5 clinically most important terms. For example, since the 1st, 2nd, 3rd, and 7th most frequent terms of Cluster 1 are among the top 5 clinically important terms, the top 5 most frequent terms of Cluster 1 include the next 5 most frequent terms (the 4th, 5th, 6th, 8th, and 9th frequent terms of the cluster).
Most common terms, both overall and within each cluster.
Clustera | Cluster 1 | Cluster 2 | Cluster 3 | Singleton 28 | Singleton 29 | ||||||
|
|||||||||||
|
Isotretinoin | 240 (2)b | 49 (2) | 6 (48) | 7 (2) | 4 (7) | |||||
Accutanc | 46 (32) | 12 (20) | 17 (11) | —d | 3 (12) | ||||||
pregnanc | 263 (1) | 57 (1) | 8 (33) | 12 (1) | 5 (3) | ||||||
drug | 166 (3) | 28 (6) | 23 (7) | 1 (50) | 8 (1) | ||||||
birth | 127 (7) | 22 (9) | 8 (33) | 1 (50) | 5 (3) | ||||||
|
|||||||||||
|
studi | 160 (4) | 31 (3) | 42 (2) | 6 (3) | 6 (2) | |||||
Us | 142 (5) | 18 (12) | 3 (107) | — | 2 (17) | ||||||
women | 139 (6) | 30 (4) | 14 (14) | — | 5 (3) | ||||||
treatment | 101 (8) | 16 (14) | 3 (107) | 2 (16) | 1 (39) | ||||||
patient | 94 (9) | 11 (24) | 9 (27) | 2 (16) | 1 (39) | ||||||
|
|||||||||||
|
studi | 160 (4) | 31 (3) | 42 (2) | 6 (3) | 6 (2) | |||||
women | 139 (6) | 30 (4) | 14 (14) | — | 5 (3) | ||||||
prevent | 72 (16) | 30 (4) | 4 (74) | 5 (4) | 2 (17) | ||||||
canadian | 62 (22) | 27 (7) | 6 (48) | 2 (16) | 1 (39) | ||||||
take | 55 (28) | 24 (8) | 9 (27) | — | 3 (12) | ||||||
|
|||||||||||
|
research | 82 (13) | 9 (27) | 43 (1) | — | 1 (39) | |||||
studi | 160 (4) | 31 (3) | 42 (2) | 6 (3) | 6 (2) | ||||||
health | 66 (19) | 1 (411) | 38 (3) | — | — | ||||||
Data | 35 (44) | — | 38 (3) | — | — | ||||||
said | 56 (27) | 7 (41) | 33 (5) | — | 4 (7) |
aThe terms in the table are stemmed versions of the actual terms (for example, us represents various forms of the verb use, and pregnanc stands for pregnancy).
bTop 5 most frequent terms of each cluster exclude the 5 clinically important terms.
cThe first number in the cells shows the frequency of occurrence of the term, and the second number in the parenthesis shows the ranking of the terms among all the termt in that cluster.
dEmpty cells (represented with a —) are the terms that do not appear in the respective cluster/singleton.
Overall, 9 readability formulas were calculated for each article in the corpus. Different readability formulas consider different variables in the calculations and measure readability from distinct perspectives (see
All calculated readability scores are above United States grade 10. Text standard scores, which represent the most prevalent reading level among all the formulas, ranged between 12 and 18, except for one article with a readability level of 9.
Distribution of readability levels of articles based on text-standard measure.
Average readability level of each cluster.
Cluster. | Flesch Reading Ease | Flesch-Kincaid Grade | Gunning Fog Index | SMOG Index | Automated Readability Index | Coleman Liau Index | Linsear Write Index | Dale-Chall Readability Score | Text Standard |
Cluster 1 | 40.78 | 13.02 | 15.19 | 15.58 | 15.21 | 14.47 | 13.27 | 9.87 | 16th grade |
Cluster 2 | 29.89 | 14.74 | 16.59 | 16.92 | 16.76 | 15.97 | 10.62 | 10.67 | 17th grade |
Cluster 3 | 49.19 | 11.35 | 13.75 | 14.33 | 13.35 | 13.32 | 8.85 | 9.39 | 14th grade |
Singleton 28 | 36.79 | 12.50 | 12.99 | 15.90 | 15.20 | 16.82 | 13.75 | 8.82 | 12th grade |
Singleton 29 | 49.55 | 11.70 | 15.11 | 15.00 | 15.40 | 14.74 | 8.08 | 9.87 | 14th grade |
Our NLP analysis of media coverage showed that the interpretation of the CNODES isotretinoin study [
Regardless of the method used to calculate reading level, the overall reading levels were too high for the average North American reader, where the target reading level should be grades 6-8 [
Our results were similar to other studies which documented high reading levels for plain language communications of scientific advice. For example, in a study of 53 qualified health claims on food and dietary supplement labels, which are regulated by the United States Food and Drug Administration, the Flesch-Kincaid grade level ranged from 5.37 to 30.30, with 77% above a grade 9 reading level [
Overall disclosure of funders was low, with only 2 media articles naming CIHR as the funding organization. Financial disclosure is especially important in journalism covering pharmaceuticals where various conflicts of interest may exist [
The CNODES study was covered by the Canadian newspaper, The Globe and Mail, which averages 3.1 million print and digital readers on a typical weekday. It received coverage from both national television (CBC and CTV) and more specialized media with niche audiences such as iPolitics, which covers federal, provincial, and international politics and policies. The study also received international coverage from Thomson Reuters (www.thomsonreuters.com), which covers a broad range of topics in media markets around the world. The articles varied in length, ranging from approximately 200 to 1000 words, in large part due to standard word limits set by each media outlet [
We had expected a significant overlap between some of the articles, with the potential for articles to be reprinted in different venues; overall, the words used in each media report were less similar than expected. Although there were commonalities between the articles, there was little evidence of republication or wholesale duplication of articles. We were not able to easily discern if certain articles were informed by others. Although the original CNODES source material did seem to influence the content of each article, each article author (or set of authors) clearly applied their own spin to the content. It is possible that if there had been more media coverage, patterns of duplication might have emerged. However, we have no evidence to suggest there were any patterns of reprinting in this corpus.
The clusters varied in the extent of overlap with our original press release and the top words used. Documents 28 and 29 had less similarity to the other articles in the corpus. It is interesting that document 28 was the American Pharmacist, which would likely employ science writers, and document 29 was the CBC in Saskatchewan, where they had direct access to one of the authors of the study who resided in Saskatchewan and was able to provide additional information from the Saskatchewan perspective. It has been noted that several publications reprint the press releases they receive without additional comment or contextualization and many media outlets are vertically integrated, although these integrations were not reflected in our analysis. It is interesting that the 10th most frequent 2-gram was
Omission of specific parts of the media release were surprising, such as the lack of disclosure around study funding (CIHR) and potential conflicts of interest. Although many of the articles did mention Health Canada, better reporting about the study team would have provided better context for the research and information on potential competing interests.
Our study takes a novel approach to tracking the media coverage of academic research after it has been published and is an important part of growing the knowledge translation component of the CNODES project, but it has its shortcomings. Our search, although comprehensive from a keyword perspective, was limited to media outlets that published on the internet. We did not search the websites of individual newspapers, with the assumption that our general Google News search would capture all relevant mentions. We did not evaluate pictures that were associated with the media articles, the way in which numbers were reported, or links to other resources. We did not consider the expertise of the journalists, specifically, whether there was a difference in the reporting between health journalists and general assignment reporters. We did not examine the length of the media article beyond its influences on reading level, so there may be further insights to be gleaned from comparing article length with specific aspects such as funding source and article positioning (eg, front page). Finally, although we believe we have captured all meaningful media coverage of our study, our data capture window was relatively short, we did not use a commercial news aggregator, and we did not specifically examine gray literature, so there is always the potential that we have missed some media articles.
We are currently not able to speak to who the articles may have deemed responsible for the original study results (ie, poor pregnancy prevention guideline adherence) or to determine the quality of the media report [
There are many known limitations to using reading-level metrics [
Placement within the media content is an important determinant of consumption and could provide an indication of an article’s perceived value. In a digital age, these factors can change significantly over time and between users. We were unable to process this information. We did not specifically examine if independent sources (such as other researchers) were used by journalists to inform context and study validity, or whether patients, voluntary health organizations, or drug regulatory agencies provided their perspectives. We were unable to identify if the journalist was an employee of the news organization or if the article came from a news wire service or syndicated service. We also did not examine if a link to the original CMAJ article was provided.
We did not consider the quality of the coverage in terms of source. Although we subjectively evaluated the coverage to deem it as relevant or not, an objective measure of quality (such as the DISCERN tool [
It is important for researchers to understand how their research is presented by the media. Our analysis demonstrates that there is little consistency in how this is done using a peer-reviewed research article, even when accompanied by a crafted press release and outreach by the primary authors. If there are potentially controversial or sensitive issues arising from the research that need to be presented carefully, then the narrative around these issues should be appropriately constructed in the wording of the press releases and an effort needs to be made to monitor how the information is being translated in real time as it is disseminated. The reading levels of the media covering research can be quite high; more efforts should be made to simplify the press releases and other knowledge translation materials generated from the research so that journalists can more easily present the research in an accessible manner. Researchers can assist journalists by identifying other aspects of their research such as broader context and limitations [
Improving the reading levels of CNODES’ dissemination efforts, particularly outside of academic literature, could improve the ability of CNODES to reach key target audiences (eg, health care providers, decision makers). Further work is needed to develop automated media coverage analysis so that researchers can quickly and efficiently identify how their research is being covered and what is and is not being consumed, with the potential to react to it in real time and correct any potential misinterpretations by media outlets. Future research will need to augment readability approaches with other approaches, such as the use of mental model research [
Although this study focused solely on the content of the words presented in the articles, future research should incorporate the use of photos, captions, hyperlinks, and multimedia to form a more complete picture of how a study was presented. Due to the changing and various ways of presenting information on the Web, this kind of project would require careful and deliberate planning and would be difficult to do on a retrospective basis.
Extending this study to social media coverage would be a valuable addition; there are large and meaningful discussion sections accompanying some of the articles in this study (eg, doc09). Our research group has studied the altmetrics of our research on social media [
This study has demonstrated that NLP can be a valuable tool in understanding how research is conveyed to the public through digital media. Through NLP, we identified significant variations in the coverage of our research and what parts of our publications journalists focused on. We demonstrated how readability calculations can be applied to media coverage. Our future work will look at expanding our methods to better understand how our research is consumed by the media.
List of articles (26 media articles, 3 Canadian Network for Observational Drug Effect Studies reference publications).
Readability scales.
British Columbia
Canadian Institutes of Health Research
Canadian Medical Association Journal
Canadian Network for Observational Drug Effect Studies
hierarchical agglomerative clustering
natural language processing
term frequency-inverse document frequency
vector space model
CNODES, a collaborating center of the Drug Safety and Effectiveness Network, is funded by the Canadian Institutes of Health Research (Grant Numbers DSE-111845 and DSE-146021). HM, RT, SA, and IS have received salary support, in part, from CIHR for the CNODES project. The authors would like to acknowledge Kim Kelly for her literature search support.
None declared.