Introduction

JMIR Form Res

formative

JMIR Formative Research

JMIR Form Res

2561-326X

JMIR Publications

Toronto, Canada

v10i1e85906

10.2196/85906

Research Letter

Machine Learning–Based Audiovisual Phenotyping for Measuring Communication, Shared Decision-Making, and Trust

Khaikin

Shely

BA1Tiruvadi

Vineet

MD, PhD23Brooks

Jeffrey

PhD3Baird

Alice

PhD3Grela-Mpoko

Anne-Catherine

BS, MPH1Hoffman

Lindsey

BS1Crossley

Jadyn

1Leasy

Menachem

MD4Fineman

Jaime

MD5Savoy

Margot

MD6Igarabuza

Laura

MD4Paranjape

Anuradha

MD7Foo

Cheryl YS

PhD8Birnbaum

Michael L

MD910Zisman-Ilani

Yaara

MA, PhD111213

Shared Decision Making Laboratory, Temple University

Philadelphia

United StatesHarvard University

Cambridge

United StatesHume AI

New York

United StatesDepartment of Clinical Family and Community Medicine, Lewis Katz School of Medicine, Temple University

Philadelphia

United StatesDepartment of Clinical Medicine, Lewis Katz School of Medicine, Temple University

Philadelphia

United StatesAmerican Academy of Family Physicians

Washington

United StatesUniversity of Colorado School of Medicine

Aurora

United StatesCenter of Excellence for Psychosocial and Systemic Research, Department of Psychiatry, Massachusetts General Hospital

Boston

United StatesNew York State Psychiatric Institute

New York

United StatesDepartment of Psychiatry, Vagelos College of Physicians and Surgeons, Columbia University

New York

United StatesDepartment of Clinical, Educational and Health Psychology, Division of Psychology and Language Sciences, University College London

1-19 Torrington Place

London

United KingdomDepartment of Social and Behavioral Sciences, Barnett College of Public Health, Temple University

Philadelphia

United StatesDepartment of Psychiatry and Behavioral Sciences, Lewis Katz School of Medicine, Temple University

Philadelphia

United States

Schwartz

Amy

Balcarras

Matthew

Fukunaga

Mayuko Ito

Gong

Ziyang

Correspondence to Yaara Zisman-Ilani, MA, PhD, Department of Clinical, Educational and Health Psychology, Division of Psychology and Language Sciences, University College London, 1-19 Torrington Place, London, WC1E 7Hb, United Kingdom; y.zisman-ilani@ucl.ac.uk

2026

332026

e85906

1510202528012026

© Shely Khaikin, Vineet Tiruvadi, Jeffrey Brooks, Alice Baird, Anne-Catherine Grela-Mpoko, Lindsey Hoffman, Jadyn Crossley, Menachem Leasy, Jaime Fineman, Margot Savoy, Laura Igarabuza, Anuradha Paranjape, Cheryl YS Foo, Michael L Birnbaum, Yaara Zisman-Ilani. Originally published in JMIR Formative Research (https://formative.jmir.org), 3.3.2026.

2026

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.

Machine learning–based audiovisual phenotyping can reveal hidden discrepancies between patients’ self-reported experiences and nonverbal expressions, offering a promising tool for objectively assessing communication quality and advancing health equity.

audiovisual digital phenotypingshared decision-makingAIartificial intelligencedepressionprimary carenatural language processing

Introduction

Although depression is highly prevalent, many patients do not engage with prescribed treatments, particularly racial and ethnic minority individuals in primary care settings where clinicians lack time and infrastructure for effective communication [1]. Shared decision-making (SDM) can enhance engagement [2], but SDM is not yet the norm [3]. Social desirability bias, power dynamics, and cultural norms may lead patients to report high SDM and trust [4] despite feeling otherwise.

Objective measurements may better capture true experiences. Current SDM assessments rely on subjective self-reports or observer ratings with no objective alternatives. Audiovisual digital phenotyping (ADP) is useful in monitoring depression [5,6] and could assess communication quality. This study evaluated multimodal ADP’s usability for assessing health communication, SDM, and trust in depression care. We compared ADP outputs from audio, visual, and language modalities with validated patient-reported measures to identify patterns of discrepancies and alignments.

MethodsStudy Design

Twenty-four participants were recruited from primary care practices. Eligible adults had a depressive disorder diagnosis (ICD-10-CM F33.xx) and a recent primary care visit. Participants completed recorded video interviews about patient-provider communication, decision-making experiences, and self-report measures: SDM-Q-9-Psy [7] (low to high SDM), CollaboRATE [8] (low to high provider engagement effort), and Trust scale [9] (low to high trust). Mean and sum scores were calculated; participants were categorized as having negative or positive communication experiences based on lower or higher scores, respectively.

From the interviews, two short video clips per participant were extracted: one reflecting positive communication experiences and one reflecting negative experiences. Verbal and nonverbal responses were analyzed with three on-premise Hume AI expression models capturing ADP [10,11]: (1) a facial expression model (FaceNet Inception-ResNet V1) capturing movement and nonverbal expression of the face; (2) a speech prosody model (Whisper-Small), assessing tone and vocal dynamics from audio; and (3) a natural language processing (NLP) model (BERT), identifying the emotionality of the spoken transcript. For each participant, the top three emotions per modality were extracted.

Alignment or discrepancy between self-report and ADP was assessed using face validity by comparing self-report scores with emotional outputs. Participants with positive experiences were matched to positive clips; those with negative experiences were matched to negative clips. Alignment required concordance between reported experience and extracted emotions (eg, negative emotions in negative clips for low scorers). Exploratory analyses examined clips with opposite experience types.

Ethical Considerations

This study was approved by the Temple University Institutional Review Board (Protocol #29435). Patients provided informed consent for participation. Participants received a US $20 gift card. Self-reported and ADP data were deidentified.

Results

Of the 24 participants who completed the study, data from 6 were analyzable after excluding cases with simultaneous on-screen appearances of participant and interviewer or poor video quality. The final sample included 3 women (50%), 5 Black participants (83%), and 4 unemployed participants (67%, Table 1). Because the six interviews lasted 48 (SD 13.1) minutes on average, we selected shorter clips for ADP analysis. Selected clip lengths were 14‐58 seconds (mean 29.4, SD 12.7 seconds), each containing approximately 30 analyzable frames per second (about 840‐3480 frames per participant for two clips). Categorization into low and high communication experience was conducted based on an SDM-Q-9-Psy score higher than 2.5 and a Trust score higher than 27.5, as most CollaboRATE scores were above average.

Table 1.

Demographic characteristics and communication experiences.

Participant characteristics	P1	P2	P3	P4	P5	P6
Age (y)	24	56	58	58	68	39
Sex	Male	Female	Female	Male	Female	Male
Hispanic or Latino	No	No	No	No	Yes	No
Race	Black or African American	White or Caucasian	Black or African American and American Indian/Native American or Alaska Native	Black or African American	Black or African American and Other	Black or African American
Employment status	Employed	Unemployed	Unemployed	Unemployed	Unemployed	Unemployed
Decision made at the consultation	Referral to outpatient center	Refills, no new decisions were made	Referrals	Keep current medication	Stop therapy	Change in medications
SDM-Q-9-Psy, mean score (range 0‐5)	1.11	1.89	3.00	4.56	5.00	5.00
CollaboRATE, mean score (range 0‐9)	5.67	4.00	6.67	9.00	9.00	9.00
Trust in provider, sum score (range 0‐55)	22	24	37	43	54	55

Four participants (P3-P6) reported positive communication experiences. However, for 3 (75%) participants, ADP analysis revealed discrepancies between self-reported positive experiences and the presence of negative (eg, distress or disappointment) or neutral (eg, confusion) emotion outputs in positive clips (Table 2). Disappointment, awkwardness, and annoyance were common negative emotions in negative clips by participants who reported positive overall experiences. These relationship-related emotions may reflect disappointment with specific aspects of the patient-provider communication. Among the ADP modalities, the greatest discrepancies between verbal content and ADP were observed in facial expression and NLP (in positive clips), whereas speech prosody aligned more closely with survey results in 2 participants (P4 and P6), with emotional outputs such as excitement and amusement (Table 2).

Table 2.

Audiovisual digital phenotyping of emotional outputs.

Participant	Clip type and modality
	Negative, mean (SD)			Positive, mean (SD)
	FEa	SP^b	NL^c	FE	SP	NL
P1^d	Amusement: 0.41 (0.13) Joy: 0.40 (0.16) Satisfaction: 0.35 (0.11)	Anxiety: 0.18 (0.19) Confusion: 0.16 (0.17) Calmness: 0.15 (0.14)	Confusion: 0.32 (0.22) Anxiety: 0.22 (0.15) Contemplation: 0.18 (0.16)	Amusement: 0.50 (0.20) Joy: 0.50 (0.23) Satisfaction: 0.40 (0.11)	Realization: 0.16 (0.15) Amusement:0.12 (0.10) Disgust: 0.12 (0.12)	Excitement: 0.27 (0.23) Enthusiasm: 0.24 (0.07) Interest: 0.21 (0.13)
P2^d	Calmness: 0.41 (0.17) Tiredness: 0.36 (0.16) Boredom: 0.32 (0.08)	Awkwardness: 0.20 (0.14) Sadness: 0.17 (0.19) Realization: 0.14 (0.11)	Annoyance: 0.31 (0.18) Disappointment: 0.24 (0.14) Pain: 0.16 (0.18)	Confusion: 0.35 (0.13) Concentration: 0.33 (0.15) Calmness: 0.31 (0.15)	Disappointment: 0.29 (0.30) Confusion: 0.25 (0.14) Realization: 0.21 (0.13)	Disapproval: 0.31 (0.25) Disgust: 0.24 (0.34) Annoyance: 0.20 (0.13)
P3^e	Confusion: 0.49 (0.14) Doubt: 0.33 (0.09) Distress: 0.28 (0.08)	Realization: 0.27 (0.17) Distress: 0.19 (0.15) Awkwardness: 0.17 (0.05)	Annoyance: 0.36 (0.18) Disappointment: 0.32 (0.17) Sadness: 0.22 (0.19)	Confusion: 0.43 (0.11) Concentration: 0.34 (0.12) Calmness: 0.32 (0.13)	Distress: 0.21 (0.28) Disappointment: 0.19 (0.16) Realization: 0.16 (0.09)	Annoyance: 0.16 (0.06) Anxiety: 0.14 (0.13) Disappointment: 0.13 (0.12)
P4^e	Confusion: 0.40 (0.15) Disappointment: 0.33 (0.05) Sadness: 0.33 (0.08)	Awkwardness: 0.18 (0.11) Realization: 0.17 (0.11) Calmness: 0.12 (0.16)	Awkwardness: 0.50 (0.10) Anxiety: 0.23 (0.17) Annoyance: 0.19 (0.11)	Pain: 0.56 (0.19) Sadness: 0.50 (0.10) Distress: 0.46 (0.09)	Determination: 0.19 (0.19) Excitement: 0.16 (0.25) Amusement: 0.15 (0.19)	Awkwardness: 0.31 (0.07) Realization: 0.20 (0.03) Doubt: 0.18 (0.12)
P5^e	Confusion: 0.35 (0.11) Distress: 0.29 (0.08) Pain: 0.27 (0.18)	Realization: 0.12 (0.13) Amusement: 0.12 (0.14) Sadness: 0.11 (0.22)	Sadness: 0.43 (0.31) Disappointment: 0.23 (0.14) Annoyance: 0.20 (0.21)	Confusion: 0.39 (0.11) Distress: 0.30 (0.06) Disappointment: 0.30 (0.08)	Realization: 0.14 (0.06) Contemplation: 0.12 (0.09) Awkwardness: 0.12 (0.07)	Disappointment: 0.16 (0.16) Realization: 0.16 (0.06) Contemplation: 0.14 (0.04)
P6^e	Interest: 0.42 (0.05) Amusement: 0.39 (0.13) Concentration: 0.33 (0.09)	Anger: 0.19 (0.23) Contemplation: 0.17 (0.11) Disgust: 0.13 (0.12)	Contemplation: 0.18 (0.14) Emphatic pain: 0.15 (0.18) Sympathy: 0.12 (0.13)	Confusion: 0.48 (0.04) Concentration: 0.44 (0.05) Doubt: 0.37 (0.04)	Realization: 0.21 (0.11) Amusement: 0.13 (0.07) Positive surprise: 0.13 (0.13)	Gratitude: 0.43 (0.28) Relief: 0.23 (0.14) Satisfaction: 0.20 (0.10)

^aFE: facial expression (FaceNet Inception-ResNet V1).

^bSP: speech prosody (Whisper-Small).

^cNL: natural language (BERT).

^dLow SDM-Q-9-Psy score.

^eHigh SDM-Q-9-Psy score.

Two participants (P1 and P2) reported negative communication experiences on surveys. In their negative clips, NLP and prosody reflected these experience (eg, anxiety), while facial expressions showed mixed patterns: P1 displayed positive emotions (eg, amusement) and P2 displayed neutral emotions (eg, calmness). In positive clips, P1 showed predominantly positive emotions across all modalities, whereas P2 displayed a mix of neutral and negative emotions (eg, confusion) across all modalities, indicating a discrepancy with the positive clip classification but alignment with P2’s overall negative self-reported communication experience. Notably, P1 exhibited similar facial expressions across positive and negative clips.

Discussion

This pilot study demonstrated the usability of multimodal ADP for evaluating patient-provider communication, SDM, trust, and engagement, with prosody showing the strongest alignment with self-reported experiences and facial expression showing the weakest alignment. Discrepancies between self-reports and nonverbal expressions may help explain high rates of service disengagement and treatment nonadherence among patients, whose nonverbal communication cues may be clinically overlooked despite reported trust and engagement [12]. Nonverbal expressions aligned with self-reports for negative experiences but contradicted self-reports for positive experiences, highlighting the need for providers to be mindful of social desirability bias and patient-provider power imbalances.

To protect privacy, analyses used on-premises technology, which offers fewer advantages than cloud-based artificial intelligence (AI) models. This created challenges with simultaneous on-screen appearances, poor lighting, and nonstandard camera angles, resulting in a reduced sample size for comparing ADP with SDM and trust measures. Despite constraints, ADP provided hundreds of thousands of analyzable frames per clip, offering extensive repeated measurements. Postappointment data collection was another limitation.

Technologically, facial expression sensitivity in depression requires optimization, as limited facial expression may affect provider responses and ADP emotion extraction. Future research should address how to implement commercial AI tools while respecting ethical requirements when handling protected health information [13]. Additional considerations for on-premises AI studies should ensure sufficient computing capacity to support analyses.

Given our predominantly Black patient sample, findings highlight providers’ need to recognize how social desirability bias, power dynamics, and cultural norms may lead patients to report positive experiences despite feeling disengaged. This demonstrates multimodal ADP’s promise for objectively assessing communication quality and advancing health equity.

Authors thank Macie Sullivan, BA, a research assistant at the Shared Decision Making Laboratory, for her help with data collection.

Artificial intelligence (AI) tools were used solely for copy editing, grammar checking, and spelling corrections during manuscript preparation. No generative content was created by AI.

Funding

This study was partly supported by the Temple University Grant-in-Aid Program.

Data Availability

Deidentified data supporting the findings of this study are available from the corresponding author upon reasonable request, subject to approval by the Temple University Institutional Review Board.

Conceptualization: YZ-I

Analysis: SK

Funding acquisition: YZ-I

Methodology & Resources: SK, VT, JB, AB, YZ-I

Project administration: SK, ACG-M, LH, JC, AP

Supervision: YZ-I

Writing – original draft: SK, YZ-I

Writing – review & editing: all authors

VT, JB, and AB have worked for Hume AI. The remaining authors declare no competing interests.

Abbreviations

ADP

audiovisual digital phenotyping

artificial intelligence

NLP

natural language processing

SDM

shared decision-making

References1

Schillok

Gensichen

Panagioti

Effective components of collaborative care for depression in primary care: an individual participant data meta-analysis

JAMA Psychiatry2025091829868876

10.1001/jamapsychiatry.2025.0183

40136273

Zisman-Ilani

Roth

Mistler

Time to support extensive implementation of shared decision making in psychiatry

JAMA Psychiatry2021111781111831184

10.1001/jamapsychiatry.2021.2247

34406346

Matthews

Savoy

Paranjape

Shared decision making in primary care based depression treatment: communication and decision-making preferences among an underserved patient population

Front Psychiatry202112681165

10.3389/fpsyt.2021.681165

34322040

Zisman-Ilani

Peek

Improving equity in shared decision-making

JAMA Intern Med2024091184911301131

10.1001/jamainternmed.2024.2993

39008309

Birnbaum

Abrami

Heisig

Acoustic and facial features from clinical interviews for machine learning-based psychiatric diagnosis: algorithm development

JMIR Ment Health2022012491e24699

10.2196/24699

35072648

Abbas

Sauder

Yadav

Remote digital measurement of facial and vocal markers of major depressive disorder severity and treatment response: a pilot study

Front Digit Health20213610006

10.3389/fdgth.2021.610006

34713091

Zisman-Ilani

Roe

Scholl

Härter

Karnieli-Miller

Shared decision making during active psychiatric hospitalization: assessment and psychometric properties

Health Commun201701321126130

10.1080/10410236.2015.1099504

27168160

Elwyn

Barr

Grande

Thompson

Walsh

Ozanne

Developing CollaboRATE: a fast and frugal patient-reported measure of shared decision making in clinical encounters

Patient Educ Couns201310931102107

10.1016/j.pec.2013.05.009

23768763

Hall

Camacho

Dugan

Balkrishnan

Trust in the medical profession: conceptual and measurement issues

Health Serv Res20021037514191439

10.1111/1475-6773.01070

12479504

Baird

Tzirakis

Brooks

The ACII 2022 affective vocal bursts workshop & competition

2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

Oct 17-21, 2022

Nara, Japan

10.1109/ACIIW57231.2022.10086002

Demszky

Movshovitz-Attias

Cowen

Nemade

Ravi

GoEmotions: a dataset of fine-grained emotions

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Jul 5-10, 2020

https://www.aclweb.org/anthology/2020.acl-main

10.18653/v1/2020.acl-main.372

Richardson

Jackson

Marable

The role of Black churches in promoting mental health for communities of socioeconomically disadvantaged Black Americans

Psychiatr Serv2024081758740747

10.1176/appi.ps.20230263

38595118

Galatzer-Levy

Tomasev

Chung

Williams

Generative psychometrics-an emerging frontier in mental health measurement

JAMA Psychiatry202601183156

10.1001/jamapsychiatry.2025.3258

41259050