Original Paper
Abstract
Background: Online medical and health communities provide a platform for internet users to share experiences and ask questions about medical and health issues. However, there are problems in these communities, such as the low accuracy of the classification of users’ questions and the uneven health literacy of users, which affect the accuracy of user retrieval and the professionalism of the medical personnel answering the question. In this context, it is essential to study more effective classification methods of users’ information needs.
Objective: Most online medical and health communities tend to provide only disease-type labels, which do not give a comprehensive summary of users’ needs. The study aims to construct a multilevel classification framework based on the graph convolutional network (GCN) model for users’ needs in online medical and health communities so that users can perform more targeted information retrieval.
Methods: Using the Chinese online medical and health community “Qiuyi” as an example, we crawled questions posted by users in the “Cardiovascular Disease” section as the data source. First, the disease types involved in the problem data were segmented by manual coding to generate the first-level label. Second, the needs were identified by K-means clustering to generate the users’ information needs label as the second-level label. Finally, by constructing a GCN model, users’ questions were automatically classified, thus realizing the multilevel classification of users’ needs.
Results: Based on the empirical research of questions posted by users in the “Cardiovascular Disease” section of Qiuyi, the hierarchical classification of users’ questions (data) was realized. The classification models designed in the study achieved accuracy, precision, recall, and F1-score of 0.6265, 0.6328, 0.5788, and 0.5912, respectively. Compared with the traditional machine learning method naïve Bayes and the deep learning method hierarchical text classification convolutional neural network, our classification model showed better performance. At the same time, we also performed a single-level classification experiment on users’ needs, which in comparison with the multilevel classification model exhibited a great improvement.
Conclusions: A multilevel classification framework has been designed based on the GCN model. The results demonstrated that the method is effective in classifying users’ information needs in online medical and health communities. At the same time, users with different diseases have different directions for information needs, which plays an important role in providing diversified and targeted services to the online medical and health community. Our method is also applicable to other similar disease classifications.
doi:10.2196/42297
Keywords
Introduction
The online medical and health community is the main platform for providing online consultation services, where users disseminate, obtain, use, and evaluate health [
- ]. Such a powerful interactive platform has a positive impact on patients’ daily disease control, self-health management, and emotional support [ - ]. However, the convenience of internet services and trust in online health information may affect users’ search behavior [ ]. China’s internet user group is huge, and the level of medical knowledge of online medical users represented by middle-aged people is uneven. Groups with lower health literacy will face higher risks [ ]. By observing the way users ask questions in many online medical and health communities, we found that they often ask questions by independently selecting departments and disease types. The accuracy of their choices is biased, which not only affects medical professionals in answering questions but also may lead to a variety of data in which other patients appear when collecting information. These drawbacks will further reduce users’ trust in online health information and increase the risk of information acquisition. Users are the main body of the online medical and health community, and whether the community can accurately understand the health needs of users is the key to providing users with accurate health information [ ].Recent studies on the needs of users in online medical and health communities mainly focused on the classification of information needs. Empirical studies are generally conducted on specific diseases (eg, chronic diseases such as osteoarthritis [
], diabetes mellitus [ ], cardiovascular disease [ ]) and certain diseases with high mortality rates (eg, breast cancer [ , ] and lung cancer [ ]). The main methods for data collection are questionnaire survey [ ] and machine learning–based methods [ , , - ]. Questionnaire survey is used to assess patients’ information needs through interviews and other means [ , ], which often has the problem of unbalanced results due to small and uneven research samples. Therefore, machine learning–based methods such as text mining and topic recognition have gradually become the most common methods to study users’ information needs. Qian and Gui [ ] applied word frequency analysis to identify the information needs of users in the elderly health community, which is mainly divided into 4 themes: coping with aging, diet and nutrition, physical exercise, and mental health. Based on the coword analysis and the latent Dirichlet allocation topic model, Wang et al [ ] found that during the COVID-19 pandemic, Chinese online medical and health community users had the highest demand for information about disease symptoms [ ]. Pérez-Pérez et al [ ] explored the most frequently mentioned symptoms of the disease and the distribution of patient emotions through semantic analysis of patients’ discussions about the disease in the Twitter intestinal disease community [ ]. Luo et al [ ] realized automatic topic identification of user requirements through a text clustering algorithm. To identify the topic of information needs, McRoy et al [ ] applied data coding and random forest–based methods for supervised text classification of data postings by breast cancer survivors to assess their unmet information needs from their own perspectives and to identify gaps between information needs and current education materials. Guo et al [ ] established a topic-based classification model using manual annotation to classify users’ information needs into 6 categories and 4 levels and used machine learning to achieve automatic classification.However, all the aforesaid studies only focused on identifying the types of users’ information needs, ignoring the importance of text classification from the perspective of disease. Chen [
] used clustering algorithms to analyze communities with 3 diseases (namely, breast cancer, diabetes mellitus, and fibromyalgia) and explored the changing nature of patients’ information needs in different communities as the disease develops. Hong et al [ ] used a content analysis approach to study questions from patients with diabetes and hepatitis. It was found that the main concerns of patients with diabetes were related to treatment, whereas the more frequently asked questions by patients with hepatitis focused on diagnosis [ ].With the development of artificial intelligence, deep neural networks have been widely used in various natural language processing tasks. Graph convolutional network (GCN) is a generalization of convolutional operations from structured mesh data to unstructured graph data [
], which has been extensively studied in many fields, such as traffic network prediction [ ], simulation of multidrug side effects [ ], and personnel reidentification [ ]. In recent years, it has also been applied in text classification [ , ], and its effectiveness in Chinese text classification has been proved by some studies [ ]. Yao et al [ ] constructed text graphs based on word co-occurrence, using documents and words as nodes and GCN for semisupervised text classification, and obtained more advanced classification results in most types of texts. However, Gao and Huang [ ] considered text contextual information and improved the general GCN text classification by combining graph embedding with bidirectional encoder representations from transformers (BERT) embedding using GCN with a gating mechanism to achieve contextual encoding acquisition [ ]. To solve the problems of sparse and insufficiently marked data in short texts, Zhao et al [ ] proposed a multihead pooling-based GCN for semisupervised short textbook classification, focusing on the structural information and important nodes of the graph to achieve powerful classification performance with low computational cost [ ]. Ye et al [ ] applied the GCN model to short-text classification by inputting both word and document nodes trained by the GCN into bidirectional long short-term memory or other classification models for further classification of short textbooks.To solve the aforesaid problems, this paper constructed a multilevel classification system for users’ needs from multiple perspectives. Taking users’ question(s) on China’s online medical and health community as the data source, the semiautomated method of “manual coding + text clustering” was used to extract the first-level disease label and the second-level information demand label. The GCN model was used to classify users’ data and evaluate the model proposed. In this way, users’ data can be observed from both subjective and objective perspectives so that medical professionals can be more targeted when answering questions, while other users can quickly identify the required information. In turn, the quality of online medical and health information is improved so that the online medical and health community can better provide medical and health information services [
].The major contributions of this paper are summarized as follows:
- This study proposes a multilevel classification method. At present, research tends to target the needs of users in a certain aspect, such as emotional needs or information needs. Different physical conditions and treatment plans caused by different diseases lead to different information needs of different users. Through the multilevel classification structure of this paper, disease-type classification and information demand classification were performed so that users can conduct more targeted information retrieval.
- This study conducted semiautomated label identification. We used a manually coded approach to segment user diseases. Combined with other related research methods, K-means text clustering was used to generate different theme text sets to identify users’ information needs. This method of labeling reduces the subjective influence of manual coding and improves the speed of label recognition.
The rest of this paper is organized as follows: The “Methods” section introduces the multilevel classification system and the implementation of a multilevel classification algorithm for users’ requirements. In the “Results” section, the experimental results are analyzed in-depth. Finally, we discuss the conclusions of the study and the suggestions for future development.
Methods
Overview
This section introduces the process of constructing a multilevel classification system for users’ needs and the implementation of a multilevel classification algorithm for users’ needs.
Data Collection
Our data were collected from “Qiuyi” [
]. Founded in 2010, Qiuyi is a medical website that combines practical tools (eg, online registration, appointments, drug inquiry) and personalized medical services and suggestions (eg, disease encyclopedia, emotional communication, answering questions) in China. As of 2019, cardiovascular disease was the leading cause of death among urban and rural residents, and the burden of this disease was on the rise. Therefore, we utilized users’ questions (a total of 25,999 data points) in the “Cardiovascular Disease” section from November 3, 2011, to December 28, 2020, as the data source for this study. For each question, the data contain fields such as the title of the user’s question, question time, detailed description of the question, treatment received, and the help the user hoped to obtain.This article used Python’s Beautiful Soup library (Python Software Foundation) for web page crawling, parsing useful information about the page, and storing it in a database. To ensure the anonymity of users’ question, the crawled content of our work is a public question that can be viewed without logging in to the website.
shows the distribution of data on users’ questions about cardiovascular diseases from 2011 to 2020. It can be seen that the users’ data continued to grow from 2011 to 2015, and the reason for this was the gradual increase in internet users; until 2016, with the increase in a large number of online medical and health communities, there was a sharp decrease in the number of questions posted due to user diversion.
Year | Count, n |
2011 | 31 |
2012 | 621 |
2013 | 935 |
2014 | 948 |
2015 | 8539 |
2016 | 61 |
2017 | 403 |
2018 | 337 |
2019 | 7195 |
2020 | 6929 |
Ethics Approval
In line with the regional requirements [
], the study did not require ethical approval because the data extracted from the users’ questions are public content, freely available on the internet, and the study was conducted using anonymized information data. However, even without the need to provide an ethical statement, we still complied with the relevant privacy and data protection legislation to ensure the security and legality of data. In our view, these measures are sufficient to protect the privacy of individuals, while avoiding the abuse or invasion of privacy caused by the failure to provide an ethical statement.Preprocessing
From the 25,999 crawled data, we randomly selected 10,000 data points as the experimental data and relied on Python’s Jieba thesaurus for accurate pattern word separation. We downloaded the thesaurus of topics related to medical terms and customized dictionaries for word segmentation to preserve proper nouns in the data. We then combined the stop thesaurus for deactivated word processing, filtering out numbers, punctuation marks, mood words, among other purposes.
Feature extraction is performed by calculating the term frequency-inverse document frequency (TF-IDF) values of words. TF-IDF is a common weighting technique used in information retrieval and data mining. In this article, it was primarily used to assess the importance of words in a user’s question data set. The importance of a word increases with the number of times it is in the document, but at the same time decreases with the number of times it is in the corpus. TF indicates the frequency with which a word appears in all documents. IDF can be obtained by dividing the total number of files by the number of files containing the word and then taking the logarithm base 10 of the resulting quotient.
Multilevel Classification System Construction
Annotation Process
The “Cardiovascular Diseases” section involves many types of diseases, such as heart disease, vascular diseases. The subdivision of disease types helps to enhance the relevance of information. The Chinese National Standard “GB_T14396-2016 Classification Codes of Diseases” [
] and MeSH (medical subject headings) were combined as annotation rules to manually annotate the disease categories of the data set. The MeSH, an authoritative subject heading list compiled by the US National Library of Medicine, provides a hierarchy of cardiovascular disease names in the subject heading list. To reduce the labeling dimension, disease types in the second level of cardiovascular disease were identified in this paper. Diseases such as “congenital heart disease” and “vascular malformations” were classified as “cardiovascular abnormalities.” “Arrhythmia,” “cardiomyopathy,” “endocarditis,” and other diseases were classified as “heart diseases,” while “hypertension,” “cerebrovascular diseases,” etc were classified as “vascular diseases.” Other data that only describe the symptoms and do not include the name of the disease, those related to surgery or medication, and those not related to cardiovascular diseases were classified as “other cardiovascular diseases.”Clustering and Topic Identification
Automatic identification of second-level labels for user requirements is performed by text clustering. The main text clustering methods that have been proposed thus far are K-means clustering, hierarchical clustering methods, latent Dirichlet allocation topic clustering models, and planar division methods. In this study, K-means clustering was selected as the main method for label identification at the second level of users’ needs. The K-means algorithm is an unsupervised iterative clustering algorithm. Taking document clustering as an example, the implementation idea of this algorithm is as follows:
- Set the number of clusters to k. From the document set of {D1, D2, D3,…, Dn}, randomly select k as the initial clustering center {A1, A2, A3,…, Ak};
- Calculate the distance between the remaining documents in the document set D and the cluster center {A1, A2, A3,…, Ak} separately and divide each document into the cluster with the smallest distance from it;
- Repeat step 1 until the document in the document set D is divided into k text clusters;
- Recalculate the centroids of each text cluster and update the cluster centers {A′1, A′2, A′3,…, A′k};
- Repeat steps 2-4 until convergence and output cluster division C={C1, C2, C3,…,Ck}.
Classification Algorithm
The GCN model is chosen as the classification model. GCN is a method of using convolutional computation on graph data and is a generalization of convolutional neural networks (CNNs) to the graph domain, which generates new node representations by aggregating node information using information from edges in the graph. For graphs G=(V, E), where V denotes the set of nodes and E represents the set of edges, the formula is given as:
L(j+1) = ρ[ÃL(j)Wj] (1)
Further, we consider the following:
- A is the graph of the n-order adjacency matrix, and if edges exist at the nodes vi, vj, then Aij=1. To update the features of the node itself, it is necessary to introduce self-join, that is, a node to the node itself exists on the edge, A=A+I;
- Ã is the normalized adjacency matrix. Normalizing the adjacency matrix eliminates the problem of large differences in features after aggregation due to differences in the number of neighboring nodes. Ã=D(–1/2)AD(–1/2); here D is the degree matrix of the graph;
- X ∈ Rn×m is a matrix containing n nodes and their initial eigenvectors, with m being the dimensionality of the eigenvectors and L0=X;
- W ∈ Rm×k is the weight matrix, with Wj being the weight matrix of the j graph convolution layer; and
- ρ is the activation function, commonly known as the sigmoid, tanh, or rectified linear unit (ReLu) function.
TextGCN is an application of a graph neural network in text classification that involves constructing a heterogeneous graph for all texts, which contains all document nodes and word nodes, and the labels of document nodes can be passed to other nodes through neighboring nodes.
Constructing Heterogeneous Images
After data preprocessing, the advantage of the data obtained is leveraged to build a large heterogeneous map that includes document nodes and word nodes. The weight between the document node and the word node Aij is equal to the TF-IDF value. Pointwise mutual information (PMI) values are used between word nodes and document nodes to calculate Aij=PMI. These values can define the mutual information of weight reuse points of edges between 2 word nodes.
PMI(i,j) = log [p(i,j)/p(i)p(j)] (2)
p(i,j) = [#W(i,j)/#W] (3)
p(i) = [#W(i)/#W) (4)
To utilize global co-occurrence information, a sliding window of fixed size is used for all documents. #W denotes the total number of sliding windows in the corpus, #W(i) denotes the number of sliding windows containing the word i, and #W(i,j) denotes the number of sliding windows containing both words i and j. PMI(i,j)>0 indicates a high semantic relevance between words and conversely a low or no semantic correlation. Therefore, when constructing the heterogeneous graph, only edges between the word nodes in which PMI(i,j)>0 are added and the sliding window size #W is set to 20. The sparse matrix was saved as a graph, and a heterogeneous graph with 25,590 nodes and 1,806,090 edges was obtained.
Building a GCN Model
In this paper, we constructed a 2-layer GCN model, fed the constructed heterogeneous graph into the model, and used the softmax() method for classification. The embedding size of the first convolutional layer was set to 300, and ReLu was used as the activation function, thereby increasing the nonlinearity of the model and strengthening the learning ability of the network. The dropout parameter is set during training to temporarily drop the neural network units from the network according to a certain probability, thus improving the model generalization ability and preventing overfitting. The output of the first-layer label prediction is spliced with the input layer of the second-layer label prediction, and finally, the output of both layers of labels is returned simultaneously, thus realizing multilayer classification.
Evaluation Metrics
In general, the performance evaluation metrics for classifiers include both efficiency and effectiveness. Efficiency refers to the time used by the classifier during the process of training the model and the time needed to make predictions on the test data, while there are various metrics to evaluate the effectiveness. In this paper, from the perspective of effect, the results of multilevel classification are evaluated, mainly including accuracy, F1-score, precision, and recall. The evaluation dimensions were (1) evaluation of the effectiveness between the GCN and combined classifiers and (2) effectiveness assessment between multilevel classification and single-level classification.
Accuracy measures the percentage of the total quantity that is correctly predicted. The more quantities that are correctly predicted, the higher the accuracy.
Accuracy = (TP + TN)/(TP + TN + FP + FN) (5)
where TP is the number of true cases (predicted positive and actual positive); TN is the number of true negative cases (predicted negative and actual negative); FP is the number of false-positive cases (predicted positive and actual negative); and FN is the number of false-negative cases (predicted negative and actual positive).
Recall is defined as the proportion of positive cases (TP) correctly determined by the model to all positive (TP + FN) cases in the data set.
Recall = TP/(TP + FN) (6)
Precision is the percentage of true cases (TP) among all positive cases (TP + FP) judged by the model.
Precision = TP/(TP + FP) (7)
The F1-score is the sum of precision and recall, and recall is considered to be as important as precision. In this paper, we used the macroaverage algorithm to calculate the F1-score.
Results
Multilevel Classification System for Users’ Information Needs
In the first level of labels obtained by manual annotation, 26.64% (2664/10,000) were identified as asking questions related to other cardiovascular diseases, 16.45% (1645/10,000) were identified as asking questions about cardiovascular abnormalities, 28.24% (2824/10,000) were identified as asking questions related to heart disease, and 28.67% (2867/10,000) were identified as asking questions related to vascular disease.
presents the keywords and sample data for each category of users’ needs, after applying K-means for text clustering, in the second level of the cardiovascular disease section. The 8 topics of users’ information needs were “Disease causes” (588/10,000, 5.88%), “Self-reported symptoms” (15.89%, 1589/10,000), “Clinical signs and symptoms” (943/10,000, 9.43%), “Clinical examination” (1041/10,000, 10.41%), “Prevention and recuperation” (1928/10,000, 19.28%), “Operative treatment” (1188/10,000, 11.88%), “Drug treatment” (1351/10,000, 13.51%), and “Therapeutic method” (1372/10,000, 13.72%). The top 5 terms with TF-IDF values were used as keywords for the information demand category. Users’ information demand for cardiovascular disease mainly tends to be in the areas of “Daily prevention and maintenance” and “Patient symptom self-report.” This may be because cardiovascular disease is a chronic disease, and people are more focused on the impact of their behavioral habits in daily life on the disease.
The final multilevel classification system of users’ needs for the online medical and health community is shown in
, with 4 types of disease topics in level 1 and 8 types of requirement topics in level 2. Vascular diseases mainly include hypertension and arterial occlusive diseases, and users are more interested in the information on “Daily prevention and maintenance.” Cardiovascular abnormalities mainly include congenital heart disease, and users are more interested in the information on “Clinical examination.” It can be seen that different users have different information needs for different diseases, so this issue should be considered in practical applications to differentiate the information service provision for different diseases. For users who want accurate and fast answers, knowing more granular, specific needs can help the site assign more experienced medical professionals accordingly.Category | Meaning | Keywords | Sample excerpts |
Operative treatment | Users inquire about surgical procedures, postoperative recovery, surgical costs, success rates, etc | Surgery, careful examination, treatment, causes, surgery | What to look for after valve replacement surgery for valvular disease |
Drug treatment | Users ask questions about medications for diseases or side effects of medications | Examinations, medications, electrocardiogram, routine, treatment | What are the side effects of bei jing ling hao and how to relieve them? |
Self-reported symptoms | Users consult whether they are ill by describing their physical condition and symptoms | Feelings, body, discomfort, causes, dizziness | What’s wrong with a red face and dizziness? The heart is weak, the blood pressure is slightly higher, the high pressure is about 150 (the patient is 64 years old), there are symptoms of panic and dizziness, and the face is often red with headaches. |
Clinical examination | Users can have consultations regarding their condition by describing about their clinical examination results or inquiring what tests are required by the disease description | Body, health check, high, impact, danger | Classmates were found to have sinus irregularities. After careful examination, the teacher took the electrocardiogram and said that the PR interval prolongation was diagnosed as sinus arrhythmia. Hello, will the sinus arrhythmia PR interval prolong? |
Therapeutic method | Users consult treatments such as surgery, drugs, or folk remedies or the effectiveness of a certain treatment | Treatment, condition, effect, method, time | What should I do to treat my hypertensive nephropathy? I have hypertensive nephropathy, how should I implement the treatment to save my life? |
Prevention and recuperation | Users consult on how to prevent the occurrence of diseases or inhibit the development of diseases, such as eating habits, daily routines | Attention, diet, food, usual, spicy | I just have low blood pressure now and then I want to do a diet regimen, How to do low blood pressure diet conditioning? |
Disease causes | Users consult about the cause or type of disease | What’s going on, take medication, feel, uncomfortable, effect | What is the cause of swollen legs and feet in hemiplegia? |
Clinical signs and symptoms | Self-diagnosis of users by consulting clinical symptoms of certain disease | Symptoms, breathing, presentation, findings, sensations | I am male and 46 years old. In the first half of the year found sweating, dry heat feeling, and shortness of breath. What are the possible symptoms of coronary heart disease? |
Empirical Research of Multilevel Classification of Users’ Needs Based on GCN
We selected 6 sets of parameter combinations commonly used in studies for training and comparing the experimental results. During the training process, we trained Text-GCN for a maximum of 200 epochs and introduced an early stopping mechanism to monitor the validation loss. The training process is stopped if the validation loss does not decrease for more than 10 consecutive cycles. Through the comparison experiment, it can be seen that the final set of dropout=0.3 and learning rate=0.01 gives the best classification results. The model classification effect obtained under different hyperparameters is shown in
. The multilevel GCN model was trained according to the optimal hyperparameters, and the results of the multilevel classification of users’ needs obtained by the experiment are shown in . Aiming at the lowest level of labels required by users, the model proposed in this study achieved accuracy, accuracy, recall, and F1-score of 0.6265, 0.6328, 0.5788, and 0.5912, respectively.shows the learning curve of the multilevel GCN. It can be seen that after 100 iterations (epoch=100), the model tends to be stable, the loss function at the second level of the test set has an upward trend in the later stage, and there is slight overfitting.
Parameters | Combination | |||||
1. Learning rate | 2. Dropout | 3. Accuracy | 4. Precision | 5. Recall | 6. F1-score | |
1. Learning rate | 0.05 | 0.01 | 0.005 | 0.05 | 0.01 | 0.005 |
2. Dropout | 0.5 | 0.5 | 0.5 | 0.3 | 0.3 | 0.3 |
3. Accuracy | 0.6085 | 0.6220 | 0.6145 | 0.5965 | 0.6265 | 0.6130 |
4. Precision | 0.6057 | 0.6283 | 0.6213 | 0.5672 | 0.6328 | 0.6186 |
5. Recall | 0.5696 | 0.5711 | 0.5514 | 0.5278 | 0.5788 | 0.5465 |
6. F1-score | 0.5769 | 0.5847 | 0.5640 | 0.5306 | 0.5912 | 0.5601 |
aThe italicized values indicate that this parameter combination obtained the optimal model effect, which will be used for model training in subsequent demonstrations.
Level | Accuracy | Precision | Recall | F1-score |
Level_1a | 0.8215 | 0.8323 | 0.8264 | 0.8288 |
Level_2b | 0.7535 | 0.7613 | 0.7538 | 0.7566 |
Final_levelc | 0.6265 | 0.6328 | 0.5788 | 0.5912 |
aIndicates the prediction of the first-level label.
bIndicates the prediction of the second-level label.
cFinal_level indicates the final prediction result combining the first- and second-level labels.
Comparison Verification
To verify the effectiveness of the GCN in multilevel classification, the bottom prediction result (Final_level) is taken for multilevel comparison validation of different classification models and single-level comparison validation of the same classification model.
Multilevel Classification Comparison Validation
To verify the effectiveness of the multilevel GCN classification model used in this paper, 2 hierarchical classification models were utilized for multilevel classification and testing in the case of the same data set. One is to construct a homogeneous combined classifier using naïve Bayes (NB) as the base classifier, and the other is to use a multilevel CNN model named hierarchical text classification CNN (HCCNN). As shown in
, a consistent classification effect evaluation metric was used to compare the bottom label prediction effect.According to the comparison results in
, it can be seen that the results of the training and testing of the multilevel GCN model used in this paper are better than those of the NB hierarchical classifier and the HCCNN, with accuracy improvements of 2.4% and 9.8%, respectively. shows the accuracy and loss value change curves of the final label prediction of the multilevel GCN and HCCNN models. The figure shows that the HCCNN model is forced to stop and reach the optimum at the 56th round of iteration, and the model accuracy is more different from that of the multilevel GCN.Models | Accuracy | Precision | Recall | F1-score |
TF-IDFb + NBc | 0.6025 | 0.6361 | 0.4853 | 0.4994 |
Word2vec + HCCNNd | 0.5285 | 0.3817 | 0.4966 | 0.3840 |
Multilevel GCNe | 0.6265 | 0.6328 | 0.5788 | 0.5912 |
aBest results (proposed model) are italicized.
bTF-IDF: term frequency-inverse document frequency.
cNB: naïve Bayes.
dHCCNN: hierarchical text classification convolutional neural network.
eGCN: graph convolutional network.
Single-Level Classification Comparison Validation
To verify whether there is an advantage of multilevel classification over the single-level classification of the bottom labels, the multilevel GCN classification results were compared with the single-level GCN classification results. The 2 layers of labels are combined to generate the lowest labels in a total of 32 categories.
shows the performance evaluation and classification results obtained.As seen from
, the multilevel GCN model showed an increase in accuracy, precision, recall, and F1-score of 11.4%, 14.02%, 14.94%, and 14.94% compared with the single-level GCN model. As shown in , the loss function and accuracy change curves of the final label prediction for the single-level GCN model show that the single-level GCN model forces the end of the iteration at the 106th time. In the case of using the same classification method, the classification results obtained by performing multilevel classification are better.Model | Accuracy | Precision | Recall | F1-score |
Single-level GCNb | 0.5125 | 0.4926 | 0.4294 | 0.4418 |
Multilevel GCN | 0.6265 | 0.6328 | 0.5788 | 0.5912 |
aItalicized entries are the optimal results of the comparison experiment.
bGCN: graph convolutional network.
Discussion
Principal Findings
In the era of big data and artificial intelligence, internet health and medical treatment have become a necessary supplement to human’s traditional health management and disease diagnosis and treatment model, but there are also a series of major challenges. Among them, the rapid classification of users’ needs information and the efficient matching of expert diagnosis and treatment and health guidance information are important to realize online health guidance and diagnosis and treatment. In this study, a semiautomated label recognition method based on “text clustering + manual recognition” was proposed to explore the multidimensional health needs classification system of users in an online medical and health community. At the same time, a double-layer GCN classification model was constructed to meet the needs of various users in the “Cardiovascular Disease” section of the online medical and health community. The first level covered 4 disease types, while the second level covered 8 health information demand types. To the best of our knowledge, this study is the first to propose a classification system that combines disease needs and medical information needs. By analyzing users’ information needs, we found that patients with various diseases have obvious differences in information needs. For example, users with vascular diseases are more interested in the information about “Daily prevention and care,” whereas users with cardiovascular abnormalities are more interested in the “Clinical examination” message. In addition, according to a previous study [
], users of the “Tumor” section of Qiuyi pay more attention to the information in the direction of “treatment,” whereas users of the “Cardiovascular Disease” section in this study paid more attention to the information in the direction of “daily prevention and nursing.” As a result, there is an increased need for differentiated information services in online medical and health communities.The GCN model constructed in this study was used for multilevel classification, and the data set of Qiuyi, an online medical and health community in China, was used for empirical research. In the empirical research, the traditional machine learning algorithm NB and the hierarchical text classification model HCCNN were introduced to perform the multilevel classification comparison experiment. Using TF-IDF to extract word vectors, NB was introduced to construct the model of combinatorial classifier, with accuracy, precision, recall, and F1-score of 0.6025, 0.6361, 0.4853, and 0.4994, respectively. The accuracy, precision, recall, and F1-score of 0.5285, 0.3817, 0.4966 and 0.3840 were obtained by combining Word2Vec with HCCNN. However, the multilevel classification model of GCN constructed in this study achieved 0.6265, 0.6328, 0.5788, and 0.5912 in the indices of accuracy, precision, recall, and F1-score. Compared with the TF-IDF + NB method, it presents a small improvement in accuracy and precision, but a large improvement in recall and F1-score. Compared with Word2Vec + HCCNN, the GCN model presents obvious advantages. The experimental results show that the model based on GCN performs better than other models in the multilevel classification of the data set in this paper. Second, to verify the effectiveness of multilevel classification, the 2 layers of labels required by users are fused to conduct a single-level classification. The multilevel classification presented more advantages compared with the single-level classification in the indices of accuracy (11% advantage), precision (14% advantage), recall (14% advantage), and F1-score (15% advantage), which indicates the necessity of multilayer classification. In general, the model constructed in this study shows good performance in classifying users’ needs in online medical and health communities. The advantage of this study is that it is the first classification model that includes both disease types and medical information needs, and it is based on an online medical and health community containing a large amount of information, which ensures the universality of the study.
Limitations and Future Research
This article also has some limitations that must be considered. First, in terms of how to improve the adaptability and robustness of the multilevel classification model, this study has the possibility of further improvement. In this study, only the questions asked by users in the “Cardiovascular Disease” section of the online medical and health community were selected as the object for analysis. The sample type is single, the number of samples is limited, and the architecture of health information requirements is relatively simple, resulting in bottlenecks in the adaptability and robustness of the multilevel health needs classification algorithm and framework. For multiple disease types and large sample data, the efficiency and effectiveness of the hierarchical classification model proposed in this study need to be further verified.
Second, in terms of sample data selection, this study focuses on the data (question) of a single disease plate on a single platform, without considering the data fusion among multiple plates across platforms and disease types. While analyzing the information about users’ demand for questions, we only focused on the identification of a single disease type, which is a means to artificially reduce the complexity of natural language processing, at the cost of all the semantic features in users’ health questions. For example, by observing the data, it can be found that some users asked about information that does not belong to the category of cardiovascular diseases at all, whereas the data extracted in this paper only relate to the category of cardiovascular diseases. Some questions involved both cardiovascular and tumor plates, but the latter could not be identified during classification. In subsequent studies, the data volume can be increased to obtain a larger scale of empirical studies. The fusion data objects across disease types can be considered to conduct classification studies with a higher degree of complexity.
Third, this study has the possibility of further upgrading the classification level while constructing the disease classification system. In this study, for the category of cardiovascular diseases, we established a 2-level manual labeling and category system according to the Chinese national standard “GB_T14396-2016 Classification Codes of Diseases” and MeSH. Therefore, the multilevel classification model based on GCN constructed in this study is only applicable to the classification problem of a 2-layer category system. Thus, we should aim to build a health information classification model that satisfies more levels. This will enable users’ needs at a lower level of granularity to be mined to achieve the refinement and precision of users’ demand identification. Finally, the experimental process of this study also has some areas that can be further improved. In the future, we can study some more appropriate clustering methods and compare their performance applicable to our data set, so as to further improve the accuracy of classification methods and try more classification methods for comparative analysis.
Conclusions
This study focused on the multilevel classification of users’ needs in an online medical and health community and constructed a multilevel classification system of users’ needs using a semiautomatic label recognition method. When applied to the data extracted from Qiuyi, an online medical and health community in China, the performance of the multilevel GCN classification model constructed was superior to other models, which can better learn the category features of the text asked by users and thus make a more accurate prediction. Compared with the existing research methods, the classification system we constructed not only more accurately locates the needs of users but also provides the direction for users’ information collection, which also helps medical professionals to provide higher quality information. To sum up, this study provides certain solutions for improving users’ experience and other types of multilevel classification tasks.
In the future, the multilayer classification framework of this study will be evaluated in more online medical and health communities and disease sections, and more advanced algorithms will be considered to further improve the performance of the classification method.
Acknowledgments
This work was partly supported by the Social Science Fund of China (Grant Number 19BTQ072).
Data Availability
The data are available at [
].Conflicts of Interest
None declared.
References
- Yang H, Guo X, Wu T, Ju X. Exploring the effects of patient-generated and system-generated information on patients’ online search, evaluation and decision. Electronic Commerce Research and Applications 2015 May;14(3):192-203. [CrossRef]
- Wicks P, Massagli M, Frost J, Brownstein C, Okun S, Vaughan T, et al. Sharing health data for better outcomes on PatientsLikeMe. J Med Internet Res 2010 Jun 14;12(2):e19 [FREE Full text] [CrossRef] [Medline]
- Lin T, Hsu JS, Cheng H, Chiu C. Exploring the relationship between receiving and offering online social support: A dual social support model. Information & Management 2015 Apr;52(3):371-383. [CrossRef]
- Liu J, Wang J. Users' Intention to Continue Using Online Mental Health Communities: Empowerment Theory Perspective. Int J Environ Res Public Health 2021 Sep 07;18(18):9427 [FREE Full text] [CrossRef] [Medline]
- Liu W, Fan X, Ji R, Jiang Y. Perceived Community Support, Users' Interactions, and Value Co-Creation in Online Health Community: The Moderating Effect of Social Exclusion. Int J Environ Res Public Health 2019 Dec 27;17(1):204 [FREE Full text] [CrossRef] [Medline]
- Zhang X, Liu S. Understanding relationship commitment and continuous knowledge sharing in online health communities: a social exchange perspective. JKM 2021 Jun 25;26(3):592-614. [CrossRef]
- Xiao N, Sharman R, Rao H, Upadhyaya S. Factors influencing online health information search: An empirical analysis of a national cancer-related survey. Decision Support Systems 2014 Jan;57:417-427. [CrossRef]
- Viswanath K, Kreuter MW. Health disparities, communication inequalities, and eHealth. Am J Prev Med 2007 May;32(5 Suppl):S131-S133 [FREE Full text] [CrossRef] [Medline]
- Hong Z, Deng Z, Evans R, Wu H. J Med Internet Res 2020 Apr 16;22(4):e13071 [FREE Full text] [CrossRef] [Medline]
- Plinsinga ML, Besomi M, Maclachlan L, Melo L, Robbins S, Lawford BJ, et al. Exploring the Characteristics and Preferences for Online Support Groups: Mixed Method Study. J Med Internet Res 2019 Dec 03;21(12):e15987 [FREE Full text] [CrossRef] [Medline]
- Luo A, Xin Z, Yuan Y, Wen T, Xie W, Zhong Z, et al. Multidimensional Feature Classification of the Health Information Needs of Patients With Hypertension in an Online Health Community Through Analysis of 1000 Patient Question Records: Observational Study. J Med Internet Res 2020 May 29;22(5):e17349 [FREE Full text] [CrossRef] [Medline]
- McRoy S, Rastegar-Mojarad M, Wang Y, Ruddy KJ, Haddad TC, Liu H. Assessing Unmet Information Needs of Breast Cancer Survivors: Exploratory Study of Online Health Forums Using Text Classification and Retrieval. JMIR Cancer 2018 May 15;4(1):e10 [FREE Full text] [CrossRef] [Medline]
- Kuruppu DC, Wijeyaratne CN, Gunawardane N, Amarasinghe I. Assessment of Information Needs of Patients with Breast Cancer: A Hospital-Based Study in Sri Lanka. Asia Pac J Oncol Nurs 2020;7(4):375-381 [FREE Full text] [CrossRef] [Medline]
- Basch EM, Thaler HT, Shi W, Yakren S, Schrag D. Use of information resources by patients with cancer and their companions. Cancer 2004 Jun 01;100(11):2476-2483 [FREE Full text] [CrossRef] [Medline]
- Wang J, Wang L, Xu J, Peng Y. Information Needs Mining of COVID-19 in Chinese Online Health Communities. Big Data Research 2021 May;24:100193. [CrossRef]
- Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Using Twitter to Understand the Human Bowel Disease Community: Exploratory Analysis of Key Topics. J Med Internet Res 2019 Aug 15;21(8):e12610 [FREE Full text] [CrossRef] [Medline]
- Chen AT. Exploring online support spaces: using cluster analysis to examine breast cancer, diabetes and fibromyalgia support groups. Patient Educ Couns 2012 May;87(2):250-257. [CrossRef] [Medline]
- Beaver K, Booth K. Information needs and decision-making preferences: comparing findings for gynaecological, breast and colorectal cancer. Eur J Oncol Nurs 2007 Dec;11(5):409-416. [CrossRef] [Medline]
- Qian Y, Gui W. Identifying health information needs of senior online communities users: a text mining approach. AJIM 2020 Dec 08;73(1):5-24. [CrossRef]
- Guo H, Na X, Hou L, Li J. Classifying Chinese Questions Related to Health Care Posted by Consumers Via the Internet. J Med Internet Res 2017 Jun 20;19(6):e220 [FREE Full text] [CrossRef] [Medline]
- Ye Z, Jiang G, Liu Y, Li Z, Yuan J. Document and Word Representations Generated by Graph Convolutional Network and BERT for Short Text Classification. 2020 Presented at: 24th European Conference on Artificial Intelligence - ECAI 2020; August 29-September 8, 2020; Santiago de Compostela, Spain p. 1-7 URL: https://ecai2020.eu/papers/1567_paper.pdf
- Cui Z, Henrickson K, Ke R, Wang Y. Traffic Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. IEEE Trans. Intell. Transport. Syst 2020 Nov;21(11):4883-4894. [CrossRef]
- Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 2018 Jul 01;34(13):i457-i466 [FREE Full text] [CrossRef] [Medline]
- Chen X, Zheng L, Zhao C, Wang Q, Li M. RRGCCAN: Re-Ranking via Graph Convolution Channel Attention Network for Person Re-Identification. IEEE Access 2020;8:131352-131360. [CrossRef]
- Wei Z, Gui Z, Zhang M, Yang Z, Mei Y, Wu H, et al. Text GCN-SW-KNN: a novel collaborative training multi-label classification method for WMS application themes by considering geographic semantics. Big Earth Data 2021 Feb 24;5(1):66-89. [CrossRef]
- Yoon H, Gounley J, Young M, Tourassi G. Information Extraction from Cancer Pathology Reports with Graph Convolution Networks for Natural Language Texts. New York, NY: IEEE; 2019 Presented at: IEEE International Conference on Big Data (Big Data); December 9, 2019; Los Angeles, CA p. 4561-4564. [CrossRef]
- Liu B, Liu X, Ren H, Qian J, Wang Y. Text multi-label learning method based on label-aware attention and semantic dependency. Multimed Tools Appl 2022 Jan 25;81(5):7219-7237. [CrossRef]
- Yao L, Mao C, Luo Y. Graph Convolutional Networks for Text Classification. Palo Alto, CA: AAAI Press; 2019 Jul 17 Presented at: Proceedings of the AAAI Conference on Artificial Intelligence; July 17, 2019; Honolulu, HI p. 7370-7377. [CrossRef]
- Gao W, Huang H. A gating context-aware text classification model with BERT and graph convolutional networks. IFS 2021 Mar 02;40(3):4331-4343. [CrossRef]
- Zhao H, Xie J, Wang H. Graph Convolutional Network Based on Multi-Head Pooling for Short Text Classification. IEEE Access 2022;10:11947-11956. [CrossRef]
- Wu B. Patient Continued Use of Online Health Care Communities: Web Mining of Patient-Doctor Communication. J Med Internet Res 2018 Apr 16;20(4):e126 [FREE Full text] [CrossRef] [Medline]
- Qiuyi. URL: http://ask.qiuyi.cn/ [accessed 2023-04-12]
- National Health Commission of the People's Republic of China. Ethical Guidelines on Bioscience and Medical Research Involving Human Subjects. The Central People's Government of the People's Republic of China. 2023. URL: http://www.gov.cn/zhengce/zhengceku/2023-02/28/content_5743658.htm [accessed 2023-03-06]
- National Health Commission of the People's Republic of China. GB/t 14396-2016 Classification and codes of diseases. Standardization Administration. 2016. URL: https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=8127A7785CA677952F9DA062463CBC41 [accessed 2023-04-13]
- Lu Q, Zhu A, Zhang J, Chen J. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example. Data Anal Knowl Discov 2019;3(4):22-32. [CrossRef]
- Lin Y. Users data. GitHub. 2023. URL: https://github.com/llinyingru/multi-level/tree/users_data [accessed 2023-04-03]
Abbreviations
BERT: bidirectional encoder representations from transformers |
CNN: convolutional neural network |
GCN: graph convolutional network |
HCCNN: hierarchical text classification convolutional neural network |
MeSH: medical subject headings |
NB: naïve Bayes |
TF-IDF: inverse document frequency |
PMI: pointwise mutual information |
ReLu: rectified linear unit |
Edited by A Mavragani; submitted 30.08.22; peer-reviewed by YJ Lee; comments to author 01.03.23; revised version received 20.03.23; accepted 22.03.23; published 20.04.23
Copyright©Quan Cheng, Yingru Lin. Originally published in JMIR Formative Research (https://formative.jmir.org), 20.04.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.