This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
The rising number of patients with dementia has become a serious social problem worldwide. To help detect dementia at an early stage, many studies have been conducted to detect signs of cognitive decline by prosodic and acoustic features. However, many of these methods are not suitable for everyday use as they focus on cognitive function or conversational speech during the examinations. In contrast, conversational humanoid robots are expected to be used in the care of older people to help reduce the work of care and monitoring through interaction.
This study focuses on early detection of mild cognitive impairment (MCI) through conversations between patients and humanoid robots without a specific examination, such as neuropsychological examination.
This was an exploratory study involving patients with MCI and cognitively normal (CN) older people. We collected the conversation data during neuropsychological examination (Mini-Mental State Examination [MMSE]) and everyday conversation between a humanoid robot and 94 participants (n=47, 50%, patients with MCI and n=47, 50%, CN older people). We extracted 17 types of prosodic and acoustic features, such as the duration of response time and jitter, from these conversations. We conducted a statistical significance test for each feature to clarify the speech features that are useful when classifying people into CN people and patients with MCI. Furthermore, we conducted an automatic classification experiment using a support vector machine (SVM) to verify whether it is possible to automatically classify these 2 groups by the features identified in the statistical significance test.
We obtained significant differences in 5 (29%) of 17 types of features obtained from the MMSE conversational speech. The duration of response time, the duration of silent periods, and the proportion of silent periods showed a significant difference (
This study shows the possibility of early and simple screening for patients with MCI using prosodic and acoustic features from everyday conversations with a humanoid robot with the same level of accuracy as the MMSE.
The rising number of patients with dementia has become a serious social problem worldwide. According to the World Health Organization (WHO), the number of patients with dementia will reach 75 million by 2030 [
Correct diagnosis and starting treatment can cure some types of dementia. However, we have not found any breakthrough treatments to date for Alzheimer disease and Lewy body dementia, which account for most of the cases. We can only delay the progress and improve the symptoms with medication.
Dementia is diagnosed comprehensively using cognitive function tests, such as the revised Hasegawa method simple intelligence assessment scale (HDS-R) [
We need a simple dementia-screening method that is less burdensome for the patient and routinely available for early detection of dementia, especially for patients with mild cognitive impairment (MCI) [
This study focuses on the early detection of MCI through conversations between patients and humanoid robots without a specific examination, such as neuropsychological examination. We analyzed speech features from the recorded everyday conversations between participants (cognitively normal [CN] people and patients with MCI) and humanoid robots to identify the effective features for detecting the signs of cognitive decline. We also conducted an automatic classification experiment with patients with MCI and CN older people using the identified features and examined the possibility of a simple dementia-screening test using their everyday conversations with humanoid robots.
Kato et al [
However, we know that major depressive disorder can also reduce cognitive function, and depression can occur as a peripheral symptom of dementia. It is necessary to identify whether a patient's cognitive decline is due to dementia or major depressive disorder. Thus, Sumali et al [
These studies showed the effectiveness of using the values of linguistic and acoustic features obtained from conversational speech. However, many of these studies used cognitive function tests as the subject of their analyses. It is not easy to conduct cognitive function tests routinely, since we need specialized doctors and psychologists to do it. In addition, they are not suitable for everyday use as patients may remember the questions and their replies if the tests are conducted multiple times, which leads to inaccurate results. Therefore, Ali et al [
There have been attempts to collect conversational speeches using a dialogue system instead of doing so manually using human labor to help doctors detect dementia early. Tanaka et al [
We hope that the use of robots will save labor and help detect early signs of dementia in medical and nursing care facilities where the problem of labor shortages is severe. Jonell et al [
This research was an exploratory study involving patients with MCI and CN older people.
We recorded everyday conversations between the participants (CN older people, patients with MCI) and humanoid robots, in addition to conversations during the MMSE. We conducted statistical analyses of speech features from these conversations to clarify the speech features of patients with MCI based on the doctors' diagnostic classification into CN and MCI from various tests, such as cognitive function tests, blood tests, and MRI.
We conducted an automatic classification experiment using these features and examined the possibility of a simple and easy screening test for dementia using people's everyday conversations with humanoid robots.
We collected the conversation data of a total of 94 participants. Of these, 47 (50%) patients had MCI and 47 (50%) were CN older people. We used the same collection method as that in the previous study [
Two psychiatrists (authors TA and KN), who are experts in dementia, confirmed the patients' diagnoses from their clinical records, cognitive function tests, and MRI tests regardless of the diagnoses they had received before participating in this study.
As mentioned earlier, there are various types of dementia, such as Alzheimer disease and frontotemporal lobar degeneration. Busse et al [
We excluded people with severe mental illnesses (major depression, bipolar disorder, schizophrenia) or difficulty speaking Japanese.
The study was conducted with the approval of the Ethics Committee, University of Tsukuba Hospital (H29-065), and it followed the ethical code for research with humans, as stated in the Declaration of Helsinki. All participants provided written informed consent to participate in the study. The collected data were anonymized so that individuals could not be identified and used for analysis.
The reward for participating in the experiment was JPY 3000 (US $22.01) for patients with MCI, while CN older people participated as volunteers.
A specialized psychologist recorded Japanese conversational speech when conducting cognitive function tests on the participants, such as the MMSE [
This study tested temporal orientation and spatial orientation for the analysis. To test temporal orientation, we asked the date, day, and season (4 seasons) on the day of the experiment, and to test spatial orientation, we asked the name of the prefecture, city, region, name of the building (or type), and the number of floors where we were experimenting. The participants provided oral answers in both tasks, and we assessed them on a 5-point scale.
Demographic and clinical characteristics of study participants (N=94).
Characteristics | Participants | |||
|
|
CNa (n=47, 50%) | MCIb (n=47, 50%) | |
|
||||
|
Female | 27 (57) | 20 (43) | |
|
||||
|
Mean (SD) | 70.6 (4.9) | 74.0 (5.4) | |
|
Range | 61-80 | 61-87 | |
|
||||
|
Mean (SD) | 4.6 (0.6) | 4.5 (0.7) | |
|
Range | 3-5 | 2-5 | |
|
||||
|
Mean (SD) | 4.7 (0.5) | 4.7 (0.5) | |
|
Range | 3-5 | 4-5 |
aCN: cognitively normal.
bMCI: mild cognitive impairment.
cMMSE: Mini-Mental State Examination.
We recorded the participants' everyday conversational speech with a humanoid robot Pepper [
Although we recorded the conversational data in WoZ format in this study, the design of the scenario assumed a 1-question-1-answer type of dialogue that we considered feasible enough in the current automatic dialogue system, rather than free dialogue like the kind between humans. In addition, we developed the scenario after conducting several trial experiments on the CN older people. The scenario is shown in
Participant in conversation with a humanoid robot.
A total of 3 questions, including Pepper's self-introduction, whether they had known Pepper before, and their first impression of Pepper
A total of 4 questions, including their physical condition on the present day, whether they were able to sleep the previous night, and things they are mindful about to stay healthy
A total of 5 questions, including the meals they had the previous day, how to prepare those meals, and what they like about their favorite food
Thanking them for the conversation and some utterances to signal the end of the experiment
The scenario consists of a question-reply section and a chat section. In the question-reply section, we inserted questions that test memory and the ability to think logically that have the potential to identify a decline in cognitive function (eg, details of the meals the participants had the previous night and how to prepare them). Meanwhile, the system delivers appropriate replies and self-disclosure of information according to the participants' responses to simulate everyday conversations in the chat section. The WoZ operator set Pepper's replies by judging whether the content of the participants' responses was positive, negative, or neither. For example, Pepper's reply when participants say they know about the robot is “Wow! I'm glad” and when they say they do not know about the robot is “Mmm, that’s a shame.”
To create an environment close to everyday conversation, we did not give the participants instructions in advance that would narrow down their choice of comments, such as “Do not ask the robot questions.” When the participants asked the robot questions, the operator avoided them by replying, “I can't answer that question,” or moved on to the next topic.
To record the conversational data, we asked the psychologist to wear a neckband-type throat microphone and a sound-collecting microphone (
Throat and lavalier microphone.
We extracted the speech features from the throat microphones and the sound-collecting microphones. Rather than comprehensively extracting many features using software, such as openSMILE [
Prosodic and acoustic features.
Extracted feature | Description |
Duration of response time (seconds) | Length of time between the beginning and the end of the participant's speech during a turn |
Duration of reaction time (seconds) | Length of time between the end of the speech by the psychologist/the humanoid robot and the beginning of the participant's speech |
Duration of speech periods (seconds) | Total value of speech periods in the duration of the response time |
Duration of silent periods (seconds) | Total value of silent periods (300 ms or more) in the duration of the response time |
Proportion of silent periods (%) | Percentage of the duration of silence periods in the duration of the speech period |
Filler periods (seconds) | The total value of speech periods for fillers, such as “er,” in the duration of the speech period |
Proportion of fillers (%) | The proportion of filler periods in the duration of the speech period |
|
Coefficient of variation of the fundamental frequency |
Jitter (local, rapa, ppq5b, ddpc) | Fluctuations in pitch |
Shimmer (local, apq3d, apq5, apq11, ddae) | Fluctuations in volume |
arap: relative average perturbation.
bppq: period perturbation quotient.
cddp: difference of difference of periods.
dapq: amplitude perturbation quotient.
edda: average absolute differences between the amplitudes of consecutive periods.
We manually labeled speech intervals and fillers using the conversational data recorded with the throat microphone. Fillers are so-called pausing and conjunction words represented by speech forms, such as “er,” “um,” and “uh.”
The fillers in the conversation speech recorded in this study included not only “er” and “um” but also “well...” and “you see...” and the much rarer cases where participants dragged out the ending of the word, such as “Asawaaaa” (“in the morning”) and “XYZ wo tabete” (“I ate XYZ”), in order to buy time to think before continuing with the rest of the sentence. However, labeling this part of the speech as a filler would risk changing the meaning of the utterance as the semantic content from the word ”morning“ would be omitted. Therefore, we labeled it as an utterance. We conducted labeling on the principle that a part of an utterance is labeled as a filler only if the meaning of the utterance does not change even if it is removed.
We extracted 7 types of features concerning the speech intervals, such as the duration of the speech period and the proportion of fillers, using the speech interval and filler labels. The duration of the speech period, the response time, and the proportion of the silence time are features that previous studies have shown to be effective [
We extracted 10 types of acoustic features from the conversations recorded with the sound-collecting microphones. Although throat microphones help to filter out environmental noises, they are significantly different from usual acoustic microphones as they are skin conducting. Therefore, we decided to refer to the speech intervals obtained with the throat microphones and estimated the acoustic features using Praat [
Apart from local, jitter includes relative average perturbation (rap) and 5-point period perturbation quotient (ppq5). Similarly, shimmer also includes amplitude perturbation quotient (apq)3 and average absolute differences between the amplitudes of consecutive periods (dda), apart from local, that require different methods of calculation [
Since the basic frequency (
We conducted a statistical significance test for each feature to clarify the speech features that are useful when classifying people into CN and MCI groups. We checked for normality in advance with the Shapiro–Whisk test and found none in all the features. Moreover, when we checked for the homoscedasticity of each feature value using the
We defined a significant difference to be when
Sample size of each experiment.
Experiment | CNa, n (%) | MCIb, n (%) |
MMSEc conversational speech | 454 (49) | 475 (51) |
Speech in the everyday conversations with the humanoid robot | 1070 (51) | 1034 (49) |
aCN: cognitively normal.
bMCI: mild cognitive impairment.
cMMSE: Mini-Mental State Examination.
We conducted an experiment on the automatic classification of patients with MCI and CN older people using the features clarified in the statistical significance test to examine the effectiveness of simple screening tests for dementia through everyday conversations with humanoid robots. We used the following 4 types of features: significant feature plus age (at the time of the experiment), significant feature, “significant and effect size over 0.1” feature plus age, and “significant and effect size over 0.1” feature.
Since we extracted each feature at every turn (from the time the psychologist or Pepper asked a question until the participant finished their reply), we obtained as many features as the number of questions per participant. Therefore, we calculated arithmetic values (mean, variance, median, range) to convert these features into 1 feature value per participant. After that, we normalized the feature value to fall within the range of 0-1.
We used Waikato Environment for Knowledge Analysis (WEKA, University of Waikato, New Zealand) [
We used accuracy, sensitivity, and specificity as evaluation indices for the results. Accuracy refers to the percentage of correct classifications. We calculated the balanced accuracy in this study for performance evaluation to avoid the problem that accuracy is not suitable for a task when the result of the correct prediction of either positive or negative examples is extremely large or small. Balanced accuracy was used as an evaluation index. Sensitivity is an index of how correctly the experiment classified patients with MCI and the rate of not missing diseases. Specificity is an index of how correctly the experiment classified CN participants and avoided unnecessary doubts that CN older people have MCI. A result was true positive (TP) when a positive result was correct (ie, when the experiment classified the patients with MCI correctly) and false negative (FN) is when it classified them incorrectly. A result was false positive (FP) when the positive result was incorrect (ie, when the experiment mistook a CN older person as a patient with MCI) and true negative (TN) when it correctly classified a CN older person as being CN. The following equations show the calculation of balanced accuracy, sensitivity, and specificity:
We administered questionnaires (
Questionnaire before recording (yes/no/don't remember):
Have you ever had a conversation with a robot?
Questionnaire after recording (rating method: on a scale of 1-5, the closer to 5, the better):
Did you find the conversation with the robot natural?
Do you want to talk to a robot again?
We obtained significant differences in 5 (29%) of 17 types of features obtained from the MMSE conversational speech, of which 3 (60%) met the reference value of the effect sizes. In contrast, we obtained significant differences in 16 (94%) of 17 types of features obtained from the speech in the everyday conversations with the humanoid robot, of which 1 (6%) met the reference value of the effect sizes. We obtained significant differences in 11 more types of features obtained from the speech in the everyday conversation with the humanoid robot and 4 fewer features that met the reference value of the effect sizes compared to the MMSE conversational speech.
Result of the statistical significance test of the features extracted from the MMSEa conversational speech.
Features | CNb, mean (SD) | MCIc, mean (SD) | Effect size r | |
Duration of response time (seconds) | 2.19 (2.80) | 3.08 (5.78) | .11 | 0.05 |
Duration of reaction time (seconds) | 0.53 (1.20) | 0.69 (1.05) | <.001 | 0.13 |
Duration of speech periods (seconds) | 1.64 (1.55) | 1.91 (2.58) | .74 | 0.01 |
Duration of silent periods (seconds) | 0.42 (1.38) | 0.95 (3.13) | <.001 | 0.13 |
Proportion of silent period (%) | 0.06 (0.14) | 0.11 (0.19) | <.001 | 0.13 |
Filler periods (seconds) | 0.11 (0.36) | 0.19 (0.50) | .01 | 0.09 |
Proportion of fillers (%) | 0.04 (0.11) | 0.05 (0.12) | .02 | 0.08 |
|
0.25 (0.18) | 0.24 (0.18) | .42 | 0.03 |
Jitter (local) | 0.04 (0.01) | 0.04 (0.01) | .49 | 0.02 |
Jitter (rapd) | 0.02 (0.01) | 0.02 (0.01) | .54 | 0.02 |
Jitter (ppq5e) | 0.02 (0.01) | 0.02 (0.01) | .42 | 0.03 |
Jitter (ddpf) | 0.05 (0.02) | 0.05 (0.02) | .54 | 0.02 |
Shimmer (local) | 0.14 (0.04) | 0.14 (0.04) | .30 | 0.03 |
Shimmer (apq3g) | 0.06 (0.02) | 0.06 (0.02) | .72 | 0.01 |
Shimmer (apq5) | 0.09 (0.03) | 0.09 (0.04) | .36 | 0.03 |
Shimmer (apq11) | 0.15 (0.06) | 0.15 (0.06) | .65 | 0.01 |
Shimmer (ddah) | 0.19 (0.07) | 0.19 (0.07) | .72 | 0.01 |
aMMSE: Mini-Mental State Examination.
bCN: cognitively normal.
cMCI: mild cognitive impairment.
drap: relative average perturbation.
eppq: period perturbation quotient.
fddp: difference of difference of periods.
gapq: amplitude perturbation quotient.
hdda: average absolute differences between the amplitudes of consecutive periods.
Result of the statistical significance test of the features extracted from the participants' speech in their everyday conversations with the humanoid robot.
Features | CNa, mean (SD) | MCIb, mean (SD) | Effect size r | |
Duration of response time (seconds) | 4.62 (7.33) | 3.83 (5.05) | .02 | 0.05 |
Duration of reaction time (seconds) | 0.90 (1.02) | 1.15 (1.33) | <.001 | 0.08 |
Duration of speech periods (seconds) | 3.39 (5.09) | 2.62 (3.01) | <.001 | 0.09 |
Duration of silent periods (seconds) | 1.29 (2.46) | 1.12 (2.12) | .02 | 0.05 |
Proportion of silent period (%) | 0.15 (0.21) | 0.14 (0.22) | .13 | 0.03 |
Filler periods (seconds) | 0.19 (0.60) | 0.15 (0.54) | .01 | 0.06 |
Proportion of fillers (%) | 0.03 (0.08) | 0.02 (0.07) | .01 | 0.05 |
|
0.21 (0.12) | 0.20 (0.14) | <.001 | 0.09 |
Jitter (local) | 0.029 (0.010) | 0.031 (0.013) | <.001 | 0.10 |
Jitter (rapc) | 0.013 (0.005) | 0.014 (0.007) | <.001 | 0.09 |
Jitter (ppq5d) | 0.014 (0.005) | 0.016 (0.006) | <.001 | 0.09 |
Jitter (ddpe) | 0.039 (0.01) | 0.043 (0.02) | <.001 | 0.09 |
Shimmer (local) | 0.12 (0.026) | 0.13 (0.031) | <.001 | 0.09 |
Shimmer (apq3f) | 0.05 (0.02) | 0.06 (0.02) | <.001 | 0.08 |
Shimmer (apq5) | 0.075 (0.019) | 0.078 (0.024) | <.001 | 0.07 |
Shimmer (apq11) | 0.119 (0.037) | 0.123 (0.041) | <.001 | 0.06 |
Shimmer (ddag) | 0.16 (0.05) | 0.18 (0.06) | <.001 | 0.08 |
aCN: cognitively normal.
bMCI: mild cognitive impairment.
crap: relative average perturbation.
dppq: period perturbation quotient.
eddp: difference of difference of periods.
fapq: amplitude perturbation quotient.
gdda: average absolute differences between the amplitudes of consecutive periods.
The cut-off value of the MMSE scores is used as a criterion for judging whether a person is suspected of having dementia when conducting the MMSE. We used existing knowledge and set the result of classification that used the value as the baseline. When the MMSE score is 27 points or less out of 30, it indicates the suspicion of MCI; 23 points or less indicate the suspicion of dementia [
The classification using this threshold showed that it was possible to classify people into patients with MCI and CN older people with an accuracy of 53.2%. Sensitivity and specificity were generally consistent with the findings in the MMSE: a score of 27/28 proved to be most promising (sensitivity of 66.34% and specificity of 72.94%) in differentiating MCI and CN [
Result of the classification using the score threshold.
|
Accuracy (%) | Sensitivity (%) | Specificity (%) | Features, n |
MCIa-CNb | 53.2 | 40.4 | 66.0 | 1 |
aMCI: mild cognitive impairment.
bCN: cognitively normal.
Result of the experiment on automatic classification of patients with MCIa and CNb older people using the MMSEc conversational speech.
Type of feature | Accuracy (%) | Sensitivity (%) | Specificity (%) | Features, n (%) |
Significant feature plus age | 62.8 | 55.3 | 70.2 | 21 (30) |
Significant feature | 61.7 | 57.4 | 66.0 | 20 (29) |
“Significant and effect size over 0.1” feature plus age | 63.8 | 66.0 | 61.7 | 13 (19) |
“Significant and effect size over 0.1” feature | 66.0 | 59.6 | 72.3 | 12 (18) |
aMCI: mild cognitive impairment.
bCN: cognitively normal.
cMMSE: Mini-Mental State Examination.
Result using everyday conversations with a humanoid robot.
Type of feature | Accuracy (%) | Sensitivity (%) | Specificity (%) | Features, n (%) |
Significant feature plus age | 66.0 | 61.7 | 70.2 | 65 (94) |
Significant feature | 57.4 | 48.9 | 66.0 | 64 (94) |
“Significant and effect size over 0.1” feature plus age | 68.1 | 70.2 | 66.0 | 5 (7) |
“Significant and effect size over 0.1” feature | 56.4 | 34.0 | 78.7 | 4 (6) |
For the MMSE conversational speech, the result showed that it is possible to classify people into patients with MCI and CN older people with an accuracy of 66.0% (“significant and effect size over 0.1” feature).
Compared to the classification result based solely on MMSE scores, the result with the speech in everyday conversations with the humanoid robot showed that it is possible to classify people with an accuracy of 68.1% (14.9% higher; “significant and effect size over 0.1” feature plus age). In addition, the result showed that it is possible to identify patients with MCI from speech in everyday conversations with a humanoid robot with the same degree of accuracy as that of the automatic classification from the MMSE conversational speech. These results demonstrate that we can conduct a simple dementia-screening test using everyday conversations with a humanoid robot.
We received responses from 43 (91%) of the 47 CN older participants and 42 (89%) of the 47 patients with MCI. Of the 43 CN older participants, 6 (14%) and of the 42 patients with MCI, 4 (10%) replied that they had had a conversation with a robot before. Of the 85 participants who responded to both questionnaires, 75 (88%) had no previous experience of having a conversation with a robot and did it for the first time in this experiment.
The questionnaire administered after the conversation with the humanoid robot had a 0-5 rating scale. The closer the rating to 5, the more the participants felt the conversations with the robot were natural and wanted to talk with it again in the future.
Result of questionnaires on the speech dialogues with the humanoid robot. CN: cognitively normal; MCI: mild cognitive impairment.
We analyzed speech features from the MMSE conversational speech and everyday conversations with a humanoid robot to identify salient features for detecting signs of cognitive decline. Furthermore, we conducted an automatic classification experiment with patients with MCI and CN older people using the features that were identified using a statistical significance test and examined the possibility of a simple dementia-screening test using everyday conversations with humanoid robots. We garnered significant differences in 11 more features in the everyday conversations with the humanoid robot than those found in the MMSE conversational speech. We also identified significant differences in acoustic features, such as jitter and shimmer, which were not present in the MMSE conversational speech. The results of automatic classification showed 66.0% accuracy for the MMSE conversational speech and 68.1% accuracy for everyday conversations with a humanoid robot.
This result showed that it is possible to identify patients with MCI from the speech in everyday conversations with a humanoid robot with the same degree of accuracy as that of the automatic classification from the MMSE conversational speech. Furthermore, on the question of whether the participants want to talk with a robot again, both CN older participants and patients with MCI scored the responses higher than 4 points. We concluded that the participants are positive about having future everyday conversations with robots again.
The response and speech times of patients with MCI tended to be more attenuated than those of CN older people in the MMSE conversational speech. Conversely, we noted an opposite trend, where their response and speech times were less attenuated in everyday conversations with the humanoid robot. Language and communication difficulties are known to be common symptoms in people with dementia [
In automatic classification, the specificity (the index of avoiding an unnecessary suspicion of dementia) tends to be higher than sensitivity (the index of identifying patients with MCI without fail) except for the “significance and effect size over 0.1” feature. Conversations with a humanoid robot simulate those that take place between people and are not created specifically to identify patients with MCI. This may explain why specificity tends to be higher. In terms of identifying patients with MCI without fail, the relevant index is sensitivity, which was higher compared to the other features for both the MMSE conversational speech and conversations with the humanoid robot when we used features at the “significant and effect size over 0.1” feature plus age. It seems that we can improve the accuracy of the performance (identifying patients with MCI without fail) by narrowing it down to a feature that meets the reference value of the effect size. However, sensitivity decreased significantly to 34.0% compared to the other values in the automatic classification experiment using the “significance and effect size over 0.1” feature. This indicates that age may have a great deal of influence. Although age is considered a significant factor in cognitive decline, it is unlikely to reflect the changes caused by cognitive decline, as age increases every year. Although it may be effective to use features that met the reference value of the effect size, further examinations are needed to determine whether to add age to the features.
In addition, on the question of whether the conversation with the robot was natural, the rating of the patients with MCI was 0.2 points higher than that of the CN older participants. Patients with MCI may not feel awkward or uncomfortable talking with a robot, since they have reduced cognitive function compared to CN older people. This may be why the patients with MCI found the conversation with the robot more natural than the CN older people did.
The problem with the everyday conversations with the humanoid robot was that the classification accuracy was up to 68.1%, which is an issue in terms of early detection of dementia. It seems that the low accuracy is caused by using only acoustic features and those acoustic features for which statistically significant differences have been confirmed. It is necessary for improving accuracy for practical use. However, accuracy increased by 14.9% when we classified the participants solely with the MMSE score set as the baseline. We may reduce the risk of not identifying patients with MCI by also using speech in everyday conversations with a humanoid robot, rather than simply judging the performance from the MMSE test scores alone. In the future, we will examine how to improve the accuracy of automatic classification from everyday conversations with a humanoid robot by analyzing speech content and new features, such as linguistic features, in the practical use of robots.
This study successfully demonstrated that it is possible to classify people into patients with MCI and CN older people even from somewhat unnatural conversations with a humanoid robot with the same level of accuracy as found with human conversations in MMSE tests. This means the study has shown that humanoid robots can identify signs of cognitive decline while having everyday conversations with people in nursing homes and other similar facilities where labor shortages are an acute problem.
This study examined a simple screening test for dementia that uses everyday conversations with a humanoid robot installed at nursing homes and similar facilities. Existing studies have not clarified what the effective features are for identifying patients with MCI, a precursor stage of dementia, from the speech in conversations with a humanoid robot. Therefore, we recorded speech in conversations between the participants of the experiment (CN older people and patients with MCI) and a humanoid robot, extracted 17 types of features from the recordings, and conducted a statistical significance test. From the result, we obtained significant differences between the CN older participants and the patients with MCI in 16 types of features, such as response time, speech time, jitter, and shimmer, and clarified effective features for detecting signs of cognitive decline through conversations with a humanoid robot.
We conducted an automatic classification experiment using an SVM to verify whether it is possible to automatically classify people into patients with MCI and CN older people from their speech in everyday conversations with a humanoid robot by examining the features identified in this study. The results showed that we can classify people into patients with MCI and CN older people with an accuracy of 68.1% from their everyday conversations with a humanoid robot. The accuracy increased by 14.9% compared to the classification conducted solely with the MMSE score (a cognitive test) we had set as the baseline.
These results suggest that we can identify patients with MCI from their everyday conversations with a humanoid robot and improve the efficacy of a simple screening test for dementia. However, the classification accuracy of the method is 68.1%, which is insufficient when we consider its prospective potential for practical use. Our future task is to improve its accuracy further by examining additional features, such as those of a linguistic nature.
amplitude perturbation quotient
cognitively normal
average absolute differences between the amplitudes of consecutive periods
difference of difference of periods
false negative
false positive
Hasegawa method simple intelligence assessment scale
mild cognitive impairment
Mini-Mental State Examination
period perturbation quotient
relative average perturbation
support vector machine
true negative
true positive
Wizard of Oz
This work was partially supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research (JSPS KAKENHI; grant no. JP19H01084).
The data sets presented in this paper are not readily available but are derived, and supporting data may be available from the corresponding authors on reasonable request and with permission from the Ethics Committee, University of Tsukuba Hospital.
DK, AK, KS, TT, MK, and YY are employees of IBM. The other authors report no conflicts of interest regarding this study.