This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on http://formative.jmir.org, as well as this copyright and license information must be included.
The ability to objectively measure the severity of depression and anxiety disorders in a passive manner could have a profound impact on the way in which these disorders are diagnosed, assessed, and treated. Existing studies have demonstrated links between both depression and anxiety and the linguistic properties of words that people use to communicate. Smartphones offer the ability to passively and continuously detect spoken words to monitor and analyze the linguistic properties of speech produced by the speaker and other sources of ambient speech in their environment. The linguistic properties of automatically detected and recognized speech may be used to build objective severity measures of depression and anxiety.
The aim of this study was to determine if the linguistic properties of words passively detected from environmental audio recorded using a participant’s smartphone can be used to find correlates of symptom severity of social anxiety disorder, generalized anxiety disorder, depression, and general impairment.
An Android app was designed to collect periodic audiorecordings of participants’ environments and to detect English words using automatic speech recognition. Participants were recruited into a 2-week observational study. The app was installed on the participants’ personal smartphones to record and analyze audio. The participants also completed self-report severity measures of social anxiety disorder, generalized anxiety disorder, depression, and functional impairment. Words detected from audiorecordings were categorized, and correlations were measured between words counts in each category and the 4 self-report measures to determine if any categories could serve as correlates of social anxiety disorder, generalized anxiety disorder, depression, or general impairment.
The participants were 112 adults who resided in Canada from a nonclinical population; 86 participants yielded sufficient data for analysis. Correlations between word counts in 67 word categories and each of the 4 self-report measures revealed a strong relationship between the usage rates of death-related words and depressive symptoms (
In this study, words automatically recognized from environmental audio were shown to contain a number of potential associations with severity of depression and anxiety. This work suggests that sparsely sampled audio could provide relevant insight into individuals’ mental health.
Depression and anxiety disorders are mental health conditions that can, and do, impact people from all geographic and socioeconomic areas of life. Those who suffer from these disorders experience a lower quality of life [
Modern smartphones are ubiquitous devices that are equipped with a number of sensors that can sense physical activity, geolocation, communication patterns, and the speech of their owners as they go about their day-to-day lives. This sensing capability offers a potential new paradigm for diagnosis and assessment, where instead of asking patients to report their feelings and behaviors relevant to their mental health, it might be possible to infer this information passively and objectively from smartphone-collected data [
Our prior efforts explored audio (nonlinguistic) features and correlates with mental health scales [
The link between the words spoken by an individual and anxiety or depression has been investigated in 2 major subdomains. The first is the acoustic features of words, that is, the qualities and characteristics of the sounds produced independent of the meaning of the words spoken. While not the focus of this work, prior work has demonstrated numerous quantifiable differences in the acoustic properties of speech in depressed individuals [
The second subdomain upon which this work focused, linguistic analysis, encompasses how an individual’s choice of words may relate to symptoms of depression and anxiety. Given this focus, the analysis of the written word and its relationship to anxiety and depression is just as relevant as the spoken word, as the methods employed in this study ignore the additional acoustic information present in the spoken word.
The analysis of speech content and word selection, sometimes referred to as content analysis in the literature, has been studied extensively in psychotherapy contexts [
In the linguistic analysis of depression, it has been widely reported that first-person singular pronoun use is correlated with depression severity. A meta-analysis of 21 studies of these correlations confirmed this relationship, where the studies performed analyses of multiple media, including writing, speech, and Facebook status updates [
While studies [
This study used data collected in a previous study [
Participants entered a 2-week observational study in which a custom app was installed onto their personal Android phone. Self-report measures of anxiety, depression, and general quality of life were collected at the beginning and end of the study. Throughout the duration of the study, the smartphone app passively collected audiorecordings of the environment (15-second recordings approximately every 5 minutes). The study was approved by the University of Toronto Health Sciences Research Ethics Board (protocol 36687).
Participants completed 4 self-report measures, in digital form within the study app, at the beginning and end of the 14-day study. A review [
The self-report scores collected at the end of the study were used for analysis because the self-report measures ask respondents to evaluate symptoms over the past 2 weeks; therefore, the window of symptom assessment would coincide with the window of electronic data collection.
To assess the severity of the exit scores, we also used the LSAS, GAD-7, and PHQ-8 scores to screen participants for social anxiety, generalized anxiety, and depression, respectively, using diagnostic thresholds found in the literature. A cutpoint of 60 [
Spoken words detected in the participants’ environments were collected by the smartphone app. To do so, audiorecordings were collected every 5 minutes for a duration of 15 seconds by the app. These audiorecordings were captured consistently throughout the study at all hours of the day. Transcripts of the audiorecordings were generated using automatic speech recognition software (Google Speech-to-Text [
A software tool, Linguistic Inquiry and Word Count (LIWC; version 2015; Text Analysis Portal for Research, University of Alberta) was used to analyze participants’ words along a number of linguistic and psychological dimensions [
Participants’ environmental words were analyzed using all possible LIWC categories except summary dimensions, punctuation marks, and informal language. This resulted in 67 total categories that were tested, including the top-level categories of function words (ie, parts of speech), other grammar (ie, more parts of speech), affect, social, cognitive processes, perceptual processes, biological processes, drives, time orientation, relativity, and personal concerns.
Participants who completed all study tasks were included in the analysis if the total number of words detected in their ambient audiorecordings was greater than a minimum of 769 words. This minimum threshold was determined by noting that LIWC was built from a corpus of words, and the least frequently observed word category in the corpus (the sexual words category) had a mean frequency of 0.13% [
Sample of Linguistic Inquiry and Word Count word categories.
Category | Example words |
Personal pronouns | I, them, her |
Common verbs | eat, come, carry |
Positive emotion | love, nice, sweet |
Social processes | mate, talk, they |
Death | bury, coffin, kill |
The resulting 67 category counts (expressed as the percentage of total words counted which fell within that category) were then tested as correlates of the 4 self-report measures by computing the Pearson correlation coefficient between each category and each measure. Significance of the correlations were tested by computing 2-sided
Of the 112 participants who completed the study, 86 participants yielded sufficient data for analysis. The study sample consisted of 43% females (37/86) and 57% males (49/86), and the average participant age was 30.1 years (SD 8.5). Participant employment status was as follows: 63% (54/86) were employed in full-time work, 16% (14/86) were employed part-time, 12% (10/86) were unemployed and job seeking, 3% (3/86) were not engaged in paying work (eg, retired or homemaker), and 6% (5/86) reported some other employment status. The 86 participants included in analysis and 26 participants excluded from analysis did not differ in mean age, gender distribution, or mean score of any of the 4 self-report measures.
Results of screening the study sample for depression and anxiety disorders.
Measure | Score, mean (SD) | Diagnostic threshold | Participants over diagnostic threshold (n=86), n (%) |
Liebowitz Social Anxiety Scale | 53.5 (25.3) | 60 | 32 (37) |
Generalized Anxiety Disorder–7 | 6.5 (4.6) | 10 | 21 (24) |
Patient Health Questionnaire–8 | 8.5 (5.5) | 10 | 30 (35) |
Sheehan Disability Scale | 10.9 (7.8) | N/Aa | N/A |
aN/A: not applicable.
Within the 86-participant sample, the mean number of audiorecordings captured was 3647 (SD 802), and the mean number of recordings that contained speech was 579 (SD 257). On average, 16% of recorded ambient audio contained intelligible speech. This low percentage is reasonable given that recordings were performed throughout all hours of the day. The average number of detected environmental words per participant was 4379 (SD 2625). While the original transcripts were destroyed after generation, the total number of recordings that contained detected speech was recorded for each participant. The mean number of words was 7.4, which seems reasonable given that the audiorecordings were 15 seconds long. All summary statistics for the total number of recordings captured, number of recordings found to contain speech, total detected words, and average word length of the transcripts are presented in
Summary statistics for word counts of the transcripts of environmental audiorecordings (n=86).
Statistic | Mean (SD) | Minimum | First quartile | Second quartile | Third quartile | Maximum |
Total recordings captured | 3646 (802) | 330 | 3764 | 3908 | 4001 | 4271 |
Recordings containing speech | 579 (257) | 91 | 390 | 574 | 725 | 1288 |
Total detected words | 4379 (2625) | 841 | 2470 | 3842 | 5720 | 14882 |
Average number of words in recordings with speech detected | 7.4 (2.0) | 3.7 | 6.2 | 6.8 | 8.0 | 15.5 |
Of the correlations presented in
Interestingly, the rates of words detected in the
Top correlations between Linguistic Inquiry and Word Count categories and Liebowitz Social Anxiety Scale, Generalized Anxiety Disorder–7, Patient Health Questionnaire–8, and Sheehan Disability Scale scores.
Word category | Percentage of total words, mean (SD) | Correlation, |
||
|
||||
death | 0.16 (0.10) | 0.32 | .002 | |
home | 0.45 (0.14) | –0.31 | .003 | |
see | 1.26 (0.28) | 0.31 | .003 | |
sexual | 0.22 (0.29) | –0.24 | .02 | |
|
||||
reward | 1.61 (0.30) | –0.29 | .007 | |
death | 0.16 (0.10) | 0.27 | .01 | |
friend | 0.35 (0.15) | 0.26 | .02 | |
prep | 11.75 (1.10) | 0.24 | .03 | |
bio | 2.07 (0.59) | –0.23 | .04 | |
relativ | 13.57 (1.10) | –0.22 | .04 | |
|
||||
death | 0.16 (0.10) | 0.41 | <.001 | |
function | 55.31 (3.13) | 0.24 | .02 | |
home | 0.45 (0.14) | –0.24 | .03 | |
reward | 1.61 (0.30) | –0.22 | .04 | |
|
||||
death | 0.16 (0.10) | 0.28 | .009 | |
friend | 0.35 (0.15) | 0.24 | .03 | |
negate | 2.29 (0.52) | 0.23 | .03 |
A key finding is the correlation between the proportion of detected words within the concept of death and all self-reported measures. This correlation was positive in all cases, meaning individuals who had more death-related words detected in their ambient audio displayed worse self-reported symptoms of social anxiety, generalized anxiety, depression, and mental health-related functional impairment. The association between the use of death-related words and depression is in line with previous studies [
In light of the fact that only the correlation between rates of death-related words and the PHQ-8 was statistically significant, it is important to note that the Bonferroni correction is known to be conservative and can cause important relationships to be deemed nonsignificant [
The first was the positive correlation between vison-related words (the
Another interesting relationship was the negative correlation between the rates of the reward-related words in the environment and self-reported symptoms of generalized anxiety (
A key feature of the methodology employed in our study is that the environmental audio recorded for each participant contained speech from any speaker in the environment—the participants themselves but also other humans and recordings (eg, television, radio, music, etc). To the best of our knowledge, no other studies have performed linguistic analysis of audio transcripts containing speech from all ambient sources. This is important to keep in mind when we discuss previous studies that focus only upon speech or writing produced by the participant.
To provide some insight into the impact of other voices in the ambient audio and this study, it is useful to first have an estimate of how much ambient speech is typically produced by the participant and how much comes from other sources. One study [
The most reported association between participant-only word categories and mental health in the literature is the association between the use of first-person personal pronouns and depression. A meta-analysis [
Several studies [
One technical limitation of this study was the sampling technique used to capture ambient audio. Ambient audiorecordings were produced quite frequently, once every 5 minutes, but for a short duration (only 15 seconds). The short duration of recording helps to preserve smartphone battery life, but it is likely that some conversations or utterances were not captured in full. A more sophisticated sampling technique would record for a variable duration, extending the recording window until silence was detected, so that complete conversations or utterances were captured.
A fundamental limitation is due to the manner in which the environmental audio is used to generate transcripts. Automatic speech recognition software does not perform as well as human transcribers for audio recorded in noisy environments or for audio containing multiple speakers who may be interrupting one another. Furthermore, this software is often being updated and improved; therefore, reproducibility and the ability to do direct comparisons is a key concern for future studies. While this limitation is significant, it is important to also note that the accuracy of Google’s Speech-to-Text API (which was used in this study) has been evaluated in clinical talk-therapy settings and demonstrating 83% sensitivity and 83% positive predictive value in detecting death-related words [
A final limitation is related to the use of LIWC to perform the linguistic analysis of the transcripts of environmental audio. LIWC is a dictionary-based tool, and as such, categorizes words without looking at contextual information that is key to human language, ignoring sarcasm, metaphor, and analogy.
This study has explored how the proportions of detected words in ambient speech audio across different grammatical and psychological categories may be associated with self-reported symptoms of social anxiety, generalized anxiety, depression, and general psychiatric impairment. We have highlighted several potential relationships, including associations between death-related words, reward-related word, and words related to vision being potentially associated with self-reported measures of social anxiety, generalized anxiety, depression, and general psychiatric impairment.
Anonymized study data set including scale scores, audiorecording metadata, and LIWC word category percentages for study participants.
All tested correlations (Pearson
Generalized Anxiety Disorder
Linguistic Inquiry and Word Count
Liebowitz Social Anxiety Scale
Patient Health Questionnaire
Sheehan Disability Scale
This research was funded in part by the Natural Sciences and Engineering Research Council of Canada Discovery Grants program (grant number RGPIN-2019-04395).
MK has been a consultant or advisory board member for GlaxoSmithKline, Lundbeck, Eli Lilly, Boehringer Ingelheim, Organon, AstraZeneca, Janssen, Janssen-Ortho, Solvay, Bristol-Myers Squibb, Shire, Sunovion, Pfizer, Purdue, Merck, Astellas, Tilray, Bedrocan, Takeda, Eisai, and Otsuka. MK has undertaken research for GlaxoSmithKline, Lundbeck, Eli Lilly, Organon, AstraZeneca, Jannsen-Ortho, Solvay, Genuine Health, Shire, Bristol-Myers Squibb, Takeda, Pfizer, Hoffman La Rosche, Biotics, Purdue, Astellas, Forest, and Lundbeck. MK has received honoraria from GlaxoSmithKline, Lundbeck, Eli Lilly, Boehringer Ingelheim, Organon, AstraZeneca, Janssen, Janssen-Ortho, Solvay, Bristol-Myers Squibb, Shire, Sunovion, Pfizer, Purdue, Merck, Astellas, Bedrocan, Tilray, Allergan, and Otsuka. MK has received research grants from the Canadian Institutes of Health Research, Sick Kids Foundation, Centre for Addiction and Mental Health Foundation, Canadian Psychiatric Research Foundation, Canadian Foundation for Innovation, and the Lotte and John Hecht Memorial Foundation.