This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
Recommender systems have great potential in mental health care to personalize self-guided content for patients, allowing them to supplement their mental health treatment in a scalable way.
In this paper, we describe and evaluate 2 knowledge-based content recommendation systems as parts of Ginger, an on-demand mental health platform, to bolster engagement in self-guided mental health content.
We developed two algorithms to provide content recommendations in the Ginger mental health smartphone app: (1) one that uses users' responses to app onboarding questions to recommend content cards and (2) one that uses the semantic similarity between the transcript of a coaching conversation and the description of content cards to make recommendations after every session. As a measure of success for these recommendation algorithms, we examined the relevance of content cards to users’ conversations with their coach and completion rates of selected content within the app measured over 14,018 users.
In a real-world setting, content consumed in the recommendations section (or “Explore” in the app) had the highest completion rates (3353/7871, 42.6%) compared to other sections of the app, which had an average completion rate of 37.35% (21,982/58,614;
Recommender systems can help scale and supplement digital mental health care with personalized content and self-care recommendations. Onboarding-based recommendations are ideal for “cold starting” the process of recommending content for new users and users that tend to use the app just for content but not for therapy or coaching. The conversation-based recommendation algorithm allows for dynamic recommendations based on information gathered during coaching sessions, which is a critical capability, given the changing nature of mental health needs during treatment. The proposed algorithms are just one step toward the direction of outcome-driven personalization in mental health. Our future work will involve a robust causal evaluation of these algorithms using randomized controlled trials, along with consumer feedback–driven improvement of these algorithms, to drive better clinical outcomes.
A recommender system (RS) is an algorithm that filters information, content, or decisions into a relevant subset of choices for an individual, using factors such as the user’s usage history and preferences [
Digital mental health platforms, ranging from direct-to-consumer apps to telemental health platforms, can increase access to care and help meet the massive demand for mental health services [
While engagement with self-guided content is beneficial, there is a need to personalize the content given the multifaceted nature of mental health (eg, patient condition, environment, and psychosocial stressors) [
In this study, we present 2 modalities for delivering these recommendations, onboarding-based and coaching conversation–based content recommendation algorithms, which were deployed in a real-world setting and evaluated with 14,018 users of the Ginger behavioral health coaching and therapy app. These two algorithms deliver recommendations within the same section of the app depending on different user states. Onboarding-based recommendations are used to initiate the care process of consuming content for new users and those that tend to use the app just for content but not for coaching. Conversation-based recommendations update to match the semantic content that users discuss with their coaches. As a formative evaluation, we measure and report the content completion rates of both approaches in a real-world setting. In addition, for the conversation-based content recommendation algorithm, we measure the relevance of recommended content cards to a user’s conversations with their coach through offline expert annotations. Our evaluation supports product and design decisions for content placements but does not allow for causal inference due to a few potential confounders. To the best of our knowledge, this is the first large-scale study to evaluate the effectiveness of mental health content recommendation systems in a real-world setting where patients are being supported with this content. Consequently, we hope that this study will help inform the burgeoning implementation of future digital health RSs across industry and academia.
This is a retrospective observational study of 14,018 individuals aged 18 years or older who use Ginger, an on-demand mental health app. The users had access to the Ginger app through their employer or health plan benefits. We only included users who used the self-guided content library in the app. Data presented here were collected from the usage patterns of these Ginger users between June and September 2021. We chose this period because it reflects the approximate timing of when all 3 conversation-based recommendation algorithms (explained in detail in the subsequent section) in consideration were serving content recommendations to Ginger users.
Age and gender demographic data were unreported for 24.8% (n=3476) and 34.5% (n=4836) of users, respectively. Of the individuals that reported age information, 7.19% (n=758) were aged 18 to 24 years, 45.25% (n=4770) were aged 25 to 34 years, 26.07% (n=2748) were aged 35 to 44 years, 20.16% (n=2125) were aged 45 to 64 years, and 1.33% (n=140) were 65 years or older. For users that reported gender, 28.08% (n=2578) were male, 61.63% (n=5659) were female, and 10.2% (n=936) were nonbinary.
This study represents a secondary analysis of preexisting deidentified data. The study team does not have access to the participants or their identifying information and does not intend to recontact participants. Ginger’s research protocols and supporting policies were reviewed and approved by Advarra’s institutional review board (Number Pro00046797) in accordance with the US Department of Health and Human Services regulations at Title 45 Code of Federal Regulations Part 46. This study protocol was reviewed by the Advarra institutional review board (IRB) and determined to be exempt from IRB oversight, as deidentified secondary data analysis is generally not regarded as research with human participants.
Ginger provides web-based on-demand mental health services, primarily through employee or health plan benefits. Using a mobile app platform, Ginger users can access text-based behavioral health coaching, teletherapy, and telepsychiatry, as well as self-guided content and assessments. For self-guided content, users have access to more than 200 clinically validated content cards. These content cards contain curated activities ranging from mindfulness exercises to psychotherapeutic education. The content is presented in a variety of formats, including meditations, breathing exercises, videos, podcasts, surveys, and readings that typically take between 2 and 10 minutes to complete. The Ginger app uses the Amplitude analytics platform to record content-related events emitted by users while using the app [
Ginger users generally access content via 1 of several pathways in the mobile app, as depicted in
First, coaches supplement their text-based coaching sessions by assigning and sending links to content cards as homework.
Second, users access content cards by searching on the self-care tab (
Third, Ginger’s content recommendation system surfaces recommendations under the recommendations (called “Explore” in the app) heading on the self-care tab.
Finally, users can browse through the content library by traversing through different categories (eg, Job Anxiety, Habit Formation, and Behavior Change) and browsing through various activities within them.
Content search in the Ginger self-care library. Members can access content from several sources in the app, including the Explore section, the content library, and the content search bar. This figure shows how users can access content via the search bar.
Onboarding response-based recommendations. This figure shows how answering the two onboarding questions can recommend content in the Explore section of the app.
Within the aforementioned Explore section, three algorithms serve recommendations: (1) onboarding-based recommendations, whereby content card suggestions are guided using the user’s onboarding responses, which are provided by all users when signing up for the service (
For example, a combination of responses by a user could be Anxious, Depressed, Family, Career, and Something else.
Anxious
Depressed
Grieving
Not motivated
Overwhelmed
Stressed
Something else
Career
Dating
Family
Hobbies
Personal finance
Personal growth
Physical health
Social life
Something else
When a user has a conversation with a coach within the past 60 days that has over 15 messages, they receive conversation-based recommendations. If not, the app defaults to onboarding-based recommendations. The algorithm will default to random recommendations if the user has only selected “something else” for both onboarding questions.
Onboarding responses are provided by all users upon signing up for the service (
The ground truth relevance of the content cards to onboarding response labels was gathered through expert annotations. Six certified mental health coaches annotated 170 Ginger content cards [
When matching these cards for a user, we constructed a similar vector for the user’s response to onboarding answers (eg, user response U: [Anxious: 1, Depressed: 1, Grieving: 0, Social life: 0]).
As we previously mentioned, we obtained the set of derived card mappings and user responses as vectors. Using these vectors, for each user, we computed the cosine similarity of their user onboarding response vector U with each of the content card vectors (C1…Ci…Cn) [
The conversation-based recommendation algorithm works by matching the semantic similarity between the content of a conversation to the text description of content cards to make recommendations suitable for a conversational snippet. An example of a recommendation made by this algorithm for a coach-user conversation is shown in
Example output of conversation-based recommendations: User mentions having anxiety due to communication at the workplace. The conversation snippet and corresponding activity card suggestions by the algorithm are shown.
A subset of important messages was extracted from a coach-user conversation and used by the conversation-based recommendation algorithm (
The conversation-based recommendation algorithm provided an ordered list of content cards from highest to lowest similarity to the conversational text using an unsupervised method that requires no training data. To do this, both the conversational text and content card descriptions were mathematically represented as embedding vectors [
Conversation-based content recommendations. This algorithm provides inference over 3 stages. Stage 1: We algorithmically identify the most important messages in a conversation (text in blue). Stage 2: We mathematically represent the text for sessions and content card description using a natural language model as document embeddings or vectors. Stage 3: For each conversational snippet, we find content cards that are most similar to important messages in the conversation and retrieve the cards with the highest similarity to the text.
We evaluated performance both offline and in the app. The offline evaluation informed algorithm design decisions, and the in-app evaluation measured the algorithm’s performance in the real-world setting. For the offline evaluation, we compared the relevance of the conversation-based recommendations to random recommendations. For the in-app evaluation, we compared the completion rates of cards recommended by the conversation-based recommendation algorithm with both the onboarding-based and random recommendations.
For the offline evaluation, we computed the probability of a recommended card being relevant (also defined as relevance rate) for the top 5 conversation-based recommendations per conversation and compared it to random recommendations to assess the relative performance of the two algorithms.
To do this, we bucketed conversation sessions by the number of text messages and reported relevance rates per bucket to understand how relevance varies with the number of messages in a session. The data set for these conversational sessions and recommendation pairs was created by generating batch predictions for 110 randomly selected text sessions between a Ginger coach and a user. The bucketed distributions by the number of messages in the session are shown in
The 110 coach-user conversational sessions were annotated using the open-source Doccano annotation tool [
We used the majority agreement rate (MAR) as our metric for interannotator agreement. In a nutshell, MAR calculates how often each annotator agrees with the majority vote from all annotators according to a classification metric such as accuracy, precision, or
Distribution of messages in conversational sessions.
Bucket (messages in session), n | Sessions, % |
0-5 | 28.26 |
5-10 | 24.76 |
10-20 | 22.07 |
20-40 | 14.13 |
40 and above | 10.76 |
Interannotator agreement.
Answers | Annotator 1 | Annotator 2 | Annotator 3 | All |
“Not relevant” | 0.607143 | 1.000000 | 1.000000 | 0.869048 |
“Relevant” | 1.000000 | 0.621622 | 0.864865 | 0.828829 |
Macroaverage | 0.803571 | 0.810811 | 0.932432 | 0.848938 |
We measured content card completion rates in the app’s Explore section for all 3 served algorithms (ie, random recommendations, onboarding-based recommendations, and conversation-based recommendations) spanning 68 days between June 2021 and September 2021 that served recommendations to 14,018 users. We also measured the completion rate of cards via the recommendations section compared to other sections of the app where cards are not recommended by these algorithms. These sections included the Content Library, Home Screen, Search, and coaching recommendations through chat. Additionally, to understand the effects of age and gender on content completion, we measured and compared completion rates of content consumed via different algorithms in the recommendation section by stratifying users by age and gender. The card completion rate is the ratio of the number of times content cards were completed in a section to the number of times cards were viewed in that respective section of the app in the same time frame. We chose to use the completion rate as a proxy for engagement compared to metrics such as click-through rate since the completion of a content card is more closely tied to a user finishing the desired activity.
We compared the relevance rates between the conversation-based recommendations and the random control recommendations using a paired
To test whether session length categories are jointly significant in predicting a differential impact of the random and conversation-based recommendations on relevance rates, we conducted an omnibus test for a model relying on the interaction of session length and recommendation method in predicting relevance rates. The resulting statistic was
Performance of the conversation-based algorithm in the offline analysis.
Messages in session, n | Relevance rate for conversation-based recommendation algorithm | Relevance rate for random control | Difference: (algorithm−control) | Sidak-adjusted |
|
0-5 | 0.099 | 0.028 | 0.071 | .23 | .98 |
5-10 | 0.433 | 0.111 | 0.322 | .001 | .46 |
10-20 | 0.393 | 0.175 | 0.218 | .01 | .91 |
20-40 | 0.47 | 0.375 | 0.95 | .20 | .95 |
40 and above | 0.753 | 0.561 | 0.192 | .01 | .96 |
Probability of a recommended card being relevant (relevance rate): conversation-based recommendation measured against the random recommendation control.
As shown in
Click and completion rates across content sources.
Content source | Clicks, n | Completions, n | Completion rate, % | Precision rate (K=5), % |
Home page | 21,863 |
8679 | 0.397 | N/Aa |
Library | 20,291 | 6939 | 0.342 | N/A |
Recommendations | 7871 | 3353 | 0.426 | N/A |
Conversations | 2364 | 1108 | 0.469 | 0.169 |
Onboarding responses | 4067 | 1712 | 0.421 | 0.149 |
Random | 1440 | 534 | 0.371 | 0.165 |
Coach chat | 7313 | 2698 | 0.369 | N/A |
Other | 7707 | 2974 | 0.386 | N/A |
aN/A: not applicable.
These
To observe if certain groups of age or gender demographics were less or more receptive to personalized recommendations, we created point plots splitting the completion rates of content delivered via the three different algorithms by age and gender categories (
Completion rates of content delivered via the three different algorithms across different age and gender categories. Note that this point plot was plotted with the Seaborn Python library using a bootstrapped sampling of data points to generate confidence intervals.
To test whether demographics jointly predict differential completion rates by recommendation type, we conducted
In this study, we presented 2 personalized methods for delivering content recommendations, namely the onboarding-based and conversation-based content recommendation algorithms. As a measure of the impact of recommendations, we observed that the recommendations section had overall higher completion rates compared to the content in other sections of the app. For the different algorithms used in this study, we noticed that the conversation-based content recommendations had the highest completion rates in the Explore section of the app over onboarding-based recommendations and random recommendations. Finally, we saw that both age and gender variables were sensitive to different recommendation methods with responsiveness to conversation-based recommendations being higher if the users were 35 years or older or identified as male.
Completion rates of content activity cards in the Explore (recommendations) section versus other sections of the app, including browsing the content library and content embedded in the chat conversations, were higher, with a 42.6% (3353/7871) completion rate. This points to the higher engagement of users in these sections. One possible confounding factor for this observation could be that the recommendations shelf lives on top of the self-care tab (
All 3 recommendation algorithms live in the same section of the app, so they could be compared without the effect of placement in the app. Conversation-based recommendations had the highest completion per card compared to onboarding-based recommendations and random recommendations. The increased relevance of content cards is associated with increased user engagement and content card completion. We purport that onboarding-based recommendations outperformed random recommendations because they were personalized to the user’s onboarding answers. Similarly, conversation-based recommendations had higher engagement rates than onboarding-based and random recommendations. We hypothesize that this was because conversation-based recommendations dynamically update as a user chats with their coach, facilitating a better care experience across the app.
During the offline analysis, we observed a trend of increasing relevance as the number of messages increased in a session. This is primarily an artifact of the algorithm design since there is a higher chance that a longer conversational session will recommend more relevant content when more topics are discussed. However, this result motivated our decision to establish a threshold of 15 messages (or an average relevance score of ~0.4 for the 10 to 20 message bucket) as the minimum number of messages required for a session to trigger conversation-based recommendations in the Ginger app.
One limitation of this work is that we cannot derive causal inferences from the results of this study, as content card recommendation completion could be driven by numerous factors besides the recommendation algorithm itself. The three different algorithms were not served randomly across the user population; rather, a user’s baseline level of engagement determined which recommendation system they were served. There might be other confounding factors associated with users attending their coaching sessions (which means they have sessions to use for recommendations) and being more motivated to complete and update their onboarding responses. Additionally, engagement can vary with confounders such as time of day and year, baseline Patient Health Questionnaire and Generalized Anxiety Disorder Assessment scores, and user resilience [
Another limitation of this work is in the choice of our user engagement metric, the card completion rate. While the completion rate is a good proxy for understanding if users are engaging with content that they click on, it does not indicate the attractiveness of a content item. This value is better served by looking at the click-through rate, which is the probability that a user will click on an item after viewing it. Unfortunately, it is difficult to estimate click-through rates in the current version of the Ginger app across different devices of different sizes. For this reason, we chose to only use the completion rates as our main metric of relevance.
Finally, our results indicate differential content completion across demographics with recommendation algorithm type, however the reasons for this occurrence are not known to us at presenty. This will be the focus of a future qualitative study.
Our findings suggest that recommended content has better engagement than other sections of the Ginger app. Thus, it will be beneficial for the app design to have minimum friction to access recommended content, preferably on the home page of the app. Further, since longer conversational sessions drive more relevant content recommendations, we want to ensure that we trigger conversation-based recommendations only for sessions with more than a threshold number of messages. As previously discussed, we have already incorporated this design decision into our recommendation infrastructure. While conversation-based recommendations may provide better engagement, the most suitable algorithm will depend on the context of usage. Onboarding-based recommendations are ideal for “cold starting” the process of recommending content for new users and users that tend to use the app just for content but not for therapy or coaching [
Recommendation systems can help scale and supplement digital mental health care with personalized content and self-care recommendations. We present and evaluate 2 knowledge-based recommenders in this study: 1 static algorithm utilizing user onboarding responses and 1 adaptive algorithm utilizing user conversations with their coach. Onboarding-based recommendations are best suited for delivering personalized recommendations to users when there are sparse or skewed content usage data sets on a platform. On the other hand, the conversation-based recommendation algorithm allows for dynamic recommendations based on additional information gathered during text-based coaching sessions spanning months, which is essential given the changing nature of mental health needs throughout treatment. The conversation-based algorithm had the highest completion rates across all recommendation methods and other sections of the Ginger app that deliver content. This algorithm also had a higher completion rate among users aged 35 years and up and male-identifying users. The proposed algorithms are but a step toward outcome-driven personalization in mental health. Future work will involve a robust causal evaluation of these algorithms using randomized control trials and consumer feedback–driven improvement of these algorithms to drive better clinical outcomes.
Message importance algorithm: development and evaluation.
institutional review board
majority agreement rate
recommender system
This work was funded through Headspace Health’s internal research and development budget. We thank Alex Boisvert, Nix Barnett, Marcelo Manjon, and William Kearns, all of whom are current employees of Headspace Health, for their feedback and contributions to this paper.
For this study, we used a data set of coaching conversations between mental health care patients and their care providers, which constitutes personal health information, and app analytics data from the Ginger mobile app, a subsidiary of Headspace Health. These data are not publicly available due to privacy reasons and to safeguard Headspace Health’s in-house analytics data.
All authors are current or past paid employees of Headspace Health.