This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients’ privacy while properly reflecting the data.
This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected.
We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMICIII (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the onetoone link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients.
The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data.
We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.
Electronic health records (EHRs) are becoming an increasingly important source of detailed information about patients because the successful integration and efficient analysis of EHRs could help solve many health care problems, such as expediting clinical decisions and enhancing patient safety. However, researchers often encounter challenges when trying to obtain highquality health data for their research, and EHRs need to be appropriately deidentified before being shared with researchers. This process requires both skill and effort.
A method for deidentifying data, including health data, is anonymization. However, recent allegations show successful reidentification attacks on anonymized data [
Therefore, there has been growing interest in establishing a method to simulate synthetic data that protects patients’ privacy while properly reflecting the data. One method for generating synthetic data is to choose an appropriate model, fit it within privacy constraints, and then simulate new data from the fitted model.
Furthermore, dimensionality reduction converts the highdimensional description of the data into a low dimension without losing crucial information and phenotypes [
Classical phenotyping methods, which require medical field experts, can be timeconsuming and expensive [
The goal of this study is to develop a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. We propose a model that relies on generalized canonical polyadic (GCP) tensor decomposition, which has demonstrated substantial outcomes in health data analysis [
There is a widespread belief that synthetic data present an insignificant privacy risk because there is no unique link or mapping between the records in the synthetic data and the records in the original data [
Tensor decomposition is an active area of research that has been widely applied to health care data [
The CP decomposition approximates a tensor by the sum of rank1 tensors using squared errors (L2 loss) [
CP factorization [
In this section, we present the methods that we used to generate synthetic longitudinal health data using generalized CP decomposition and various sampling and simulation techniques. We also describe the data, the evaluation method, and the experimental details of our study.
We first describe the preliminaries and notations used in this paper. Before we begin,
The GCP decomposition approximates a
Where
In addition, it is often convenient to express the decomposition with a positive weights vector of
The GCP decomposition was carried out by minimizing the loss between
Where
Therefore, for a 3way tensor
Where
For simplicity, we discussed a 3way tensor scenario; however, this approach generalizes to
The notations used in this paper.
Notations  Descriptions 
◦  Outer product 

Number of dimensions (modes) of a tensor 

Number of ranks 
The generalized canonical polyadic decomposition.
Our goal was to generate synthetic longitudinal health data, which we will refer to as
In addition, the patient factor matrix
For addressing missing observations in the GCP model, Hong et al [
Let
We proposed combining
Furthermore, the EHR data might be an irregular tensor, with patients having a varied number of clinical visits. However, the input for CP and generalized CP decompositions must be a regular tensor. Therefore, we proposed converting the irregular observed tensor into the regular one by adding extra missing visits. We then performed the GCP decomposition and added the number of clinical visits as a new variable to the patient factor matrix
The generative model in terms of the generalized canonical polyadic decomposition.
This section summarizes the copula principles that we used in this study. Synthetic data generation using copula has recently attracted a lot of attention because some deep learning generative models, such as generative adversarial networks (GANs), require a very large data set for the learning stage and are therefore unproductive for small data sets. Copula models might be the most effective method to describe dependencies and marginal distributions. Furthermore, these models appear to be among the best options for data synthesis based on complex and small actual data sets [
We assumed that
Generated variables
Generated the uniform variables such that
Generated samples
However, the sampling will never be exact, and because we were sampling in the latent space, the correlation of the synthetic tensor might not be what we want. Therefore, we suggest selecting a sample such that the Frobenius norm of the difference between the correlation matrices of the original and synthetic data is less than a threshold ∈, which can be viewed as an optimization.
Finally, we produced the synthetic patient factor matrix from the obtained samples, such that
Another technique that we proposed is to synthesize the patient factor matrix
In this section, we sampled from the patient factor matrix
We began by assuming that the patient factor matrix variables
The model block of STAN is provided in section S2 in
This study used the MIMICIII data set [
The continuous data set that we derived from the MIMICIII data and used to evaluate the performance of our proposed model is a 3way tensor of laboratory measurements for patients within the hospital who had 36 clinical visits. This is derived from the “LABEVENT” table in MIMICIII. The resulting tensor consists of 226 patients, 4 laboratory tests (creatinine, potassium, sodium, and the hematocrit), and 36 clinical visits, with 21% of the data missing. The categorical data set used in our analysis is a tensor consisting of 246 patients, 2 categorical features (admission type and admission location), and 5 clinical visits, with no missing entries. This was obtained from the “ADMISSIONS” table of MIMICIII. The detailed descriptions of the tables can be found in the study by Johnson et al [
We describe how we evaluated the utility and privacy risks of the synthetic data set.
The analysis of synthetic data should provide similar statistical inferences and conclusions to those obtained from actual data analysis. Therefore, we evaluated the proposed method on the utility aspect of the generated data [
We assessed the ability of the developed generalized CP framework to create synthetic data in terms of dependency structure and marginal fitting (univariate distribution similarity) using the following:
The absolute difference in correlations between variables in the original and synthetic data
The Hellinger distance between the synthetic and original variables indicates whether they are drawn from the same distribution. The Hellinger distance is a metric in the range of 0 to 1, where 0 indicates no difference between the distributions.
The root mean square difference between the actual and synthetic correlations (RMSDC), which means the root mean square difference between the correlations of the original variables and those of the associated synthetic variables, is used to measure how well the dependency structure is captured. A lower RMSDC indicates a better capture of the dependency structure.
Descriptive statistics
The statistical characteristics of a synthetic data set need to match those of the original data. However, a single record in the synthetic data does not relate to or correspond to a single record in the original data set [
We evaluated 3 different simulation and sampling techniques on the generalized CP’s patient factor matrix, which contains patient phenotypes. Initially, we arranged the study into trials on dense and scarce continuous data sets and experiments on dense categorical data. Therefore, we started imputing the continuous data set with 21% missing observations using the GCP decomposition. We then used all 3 previously mentioned techniques to synthesize the imputed version of the original tensor, as well as the imputed data containing just the first 5 or 10 clinical visits.
The GCP decomposition was conducted using various loss functions depending on the kind of variables in our trials. These included gamma, βdivergence (similar to gamma), and Gaussian with nonnegativity constraints. Despite the continuous data set being nonnegative, the Gaussian loss (L2 loss) outperformed the others for all 3 approaches. This is because the dependency structure and univariate distributions of the original variables were significantly better preserved in the generated synthetic data. Refer to section S1 in
Finding the rank of a tensor is necessary for GCP decomposition. Following the selection of the loss function, we attempted to find the rank
The GCP tensor decomposition was conducted using the algorithm of Hong et al [
We applied the copula method using parametric and nonparametric marginals. For the nonparametric marginals, we used the empirical cumulative distribution function (CDF) and kernel smoothing. In addition, we performed the Gaussian copula based on parametric marginals using gamma, beta, and truncated Gaussian distributions, as suggested in the study by Benali et al [
We successfully synthesized the patient factor matrix
Finally, we applied sequential trees to validate the generative model on the data set with missing and irregular clinical visits. The categorical data trial was accomplished by GCP decomposition using Poisson with log link and Gaussian losses. Following that, the patient factor matrix was sampled using all 3 approaches.
This study was approved by the Children’s Hospital of Eastern Ontario (CHEO) Research Institute Research Ethics Board (protocol number 24/18X). The MIMICIII database is a thirdparty anonymous public database approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, Massachusetts) and the Massachusetts Institute of Technology (Cambridge, Massachusetts).
In this section, we report and analyze the results of our experiments on generating synthetic longitudinal health data using the proposed model. We demonstrate that our method is capable of handling different data structures and scenarios. We conducted numerous experiments and considered the following structures for both the original data and the synthetic data to validate our model:
Synthetic data for dense original data with continuous variables
Synthetic data with a varying number of patients compared to the original data
Synthetic data for original data with missing observations
Synthetic data for original data with irregular clinical visits
Synthetic data for dense original data with categorical variables
According to the evaluations, βdivergence is not a suitable objective function for synthesizing continuous data in EHR with patient factor matrix simulation using sequential trees or HMC with the multivariate Gaussian distribution model. Refer to section S3 in
We learned through several analyses that standardizing data and using Gaussian loss improves the results significantly. In the following, we present the outcomes of synthesizing a dense continuous data set that contains 226 patients, 4 laboratory test variables, and 5 clinical visits. In the preceding sections, we described how we obtained the data set for our study. The model used GCP decomposition with Gaussian loss and
According to section S4 in
According to the summaries presented in
The following are the outcomes of sampling the patient factor matrix of the previously mentioned GCP decomposition using the sequential trees approach. According to section S5 in
The summary presented in
The following is the result of MCMC method using the HMC algorithm: if the distribution of the HMC model is well specified, the resulting outcome will be more effective.
Section S6 in
On the basis of
Finally, we present
The findings and figures demonstrate that all 3 synthetic data sets have similar statistical properties in terms of dependency and univariate distributions. However, the copula and sequential trees performed slightly better than the MCMC technique when the HMC algorithm was used. In brief, we need to define a proper distribution that corresponds to the latent space to obtain decent results through HMC sampling. The Gaussian loss is the ideal loss function for simulating from the patient factor matrix using sequential trees, but the observed data must be standardized beforehand. Further analysis, which we have not included in this paper, has shown that it is preferable to use empirical CDF marginals instead of parametric ones when sampling the patient factor matrix by copula. The outcomes of generating synthetic data using βloss in GCP decomposition can be found in section S3 in
In the following section, we will provide the results of generating synthetic data with an additional number of patients compared to the original data set.
The box plots display the variation of the Hellinger distance and the Pearson correlation between the original variables and the synthetic variables generated by copula. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance between the original variables and the synthetic variables. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of the copula’s synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  2.23  110.6  13.64 
Median (IQR)  1.08 (0.522.21)  4.17 (3.774.62)  138 (134.5141.9)  31.64 (27.9735.75) 
Mean (SD)  1.67 (1.58)  4.22 (0.59)  138 (6.2)  31.9 (4.1) 
Maximum^{b}  14.1  7.32  157.5  48.49 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
A summary of the original variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0.2  2.5  109  9.2 
Median (IQR)  1 (0.71.6)  4.1 (3.84.51)  138 (135141.6)  31.6 (28.135.7) 
mean (SD)  1.64 (1.6)  4.23 (0.62)  138.4 (6.5)  32.08 (5.82) 
Maximum^{b}  16.2  10  170  52.6 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
The box plots display the variation of the Hellinger distance and the Pearson correlation between the original variables and the synthetic variables generated by sequential decision trees. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance between the original variables and the synthetic variables. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of the sequential decision trees’ synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  1.88  116.4  10.09 
Median (IQR)  1.25 (0.671.86)  4.18 (3.74.71)  138.8 (134.2143.3)  31.68 (27.0136.38) 
Mean (SD)  1.53 (1.62)  4.23 (0.58)  138.7 (6.6)  31.87 (5.8) 
Maximum^{b}  14.84  8.16  177.4  54.8 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
The box plots display the variation of the Hellinger distance and Pearson correlation between the original variables and the Hamiltonian Monte Carlo synthetic variables. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance between the original variables and the synthetic variables. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of the Hamiltonian Monte Carlo’s synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  0.66  117.8  7.22 
Median (IQR)  1.85 (0.63.1)  4.19 (3.744.7)  138.6 (134.5142.8)  32.18 (27.8536.47) 
Mean (SD)  2.01 (1.4)  4.23 (0.54)  138.5 (5.02)  31.93 (4.8) 
Maximum^{b}  7.24  6.78  156.9  51.63 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
The plots show the correlation and distribution of the original and synthetic variables. A correlation matrix displays bivariate scatter plots of the adjacent variables below the diagonal, histograms of the data distribution of the respective variables on the diagonal, and the Pearson correlation above the diagonal. Ellipses specify the direction of the correlation. The information regarding the relationship between the 2 selected variables is always perpendicular to each other. (A) This is the plot of the original variables. (B) This is the plot of synthetic variables generated by copula. (C) This is the plot of synthetic variables generated by sequential decision trees. (D) This is the plot of synthetic variables generated by Hamiltonian Monte Carlo.
We can generate different numbers of patients in the synthetic data using all 3 patient factor matrix simulations. However, we only applied the sequential trees approach, and the results are as follows: the findings indicate that the dependency and univariate structure of the original variables are well maintained in the synthetic data.
The following are the outcomes of generating 250 patients from the patient factor matrix using sequential trees: the patient factor matrix is derived from the GCP decomposition in the previous experiment. The original data set is dense, consisting of 226 patients, and is the same data set used in the previous experiment.
In this scenario, the summary presented in
On the basis of the results of this section, we are optimistic that our model may perform even better when generating larger data sets.
The plots show the correlation and distribution of sequential decision trees’ synthetic variables as well as the original variables. A correlation matrix displays bivariate scatter plots of the adjacent variables below the diagonal, histograms of the data distribution of the respective variables on the diagonal, and the Pearson correlation above the diagonal. Ellipses specify the direction of the correlation. The information regarding the relationship between the 2 selected variables is always perpendicular to each other. (A) This is the plot of the original variables. (B) This is the plot of the synthetic variables.
The box plots display the variation of the Hellinger distance and the Pearson correlation between the original variables and the sequential decision trees’ synthetic variables. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance for all variables between the original and synthetic data sets. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of sequential decision trees’ synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  2.16  108.3  8.97 
Median (IQR)  1.22 (0.761.88)  4.18 (3.714.72)  138.4 (134.4142.2)  31.44 (27.3735.87) 
Mean (SD)  1.67 (1.6)  4.26 (0.61)  138.4 (5.6)  31.56 (5.83) 
Maximum^{b}  16.42  7.94  163.1  53.04 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
The following is a trial on the continuous data derived from the MIMICIII data set without imputation. As mentioned earlier, the data set consists of 226 patients, 4 laboratory tests, and 36 clinical visits, with 21% of the observations missing. We have previously described the approach for synthesizing this sort of data. In this experiment, we attempted to sample the patient factor matrix using sequential trees and generate the same sample size of 226 patients as in the original data. We performed GCP factorization with Gaussian loss and
The structure of the synthetic and original data sets in different modes is similar, as shown in section S9 in
According to the summary presented in
The plots show the correlation and distribution of sequential decision trees’ synthetic variables as well as the original variables. A correlation matrix displays bivariate scatter plots of the adjacent variables below the diagonal, histograms of the data distribution of the respective variables on the diagonal, and the Pearson correlation above the diagonal. Ellipses specify the direction of the correlation. The information regarding the relationship between the 2 selected variables is always perpendicular to each other. (A) This is the plot of the original variables. (B) This is the plot of the synthetic variables.
The box plots display the variation of the Hellinger distance and the Pearson correlation between the original variables and the sequential decision trees’ synthetic variables. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance for all variables between the original and synthetic data sets. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of the sequential decision trees’ synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  0.71  117.5  7.07 
Median (IQR)  1.3 (0.851.87)  4.12 (3.644.62)  138.6 (135.3142)  29.64 (26.5832.77) 
Mean (SD)  1.47 (0.97)  4.13 (0.76)  138.6 (5.02)  29.82 (4.8) 
Maximum^{b}  7.97  7.94  156.8  57.9 
#NA’s^{c}  3893  3636  3733  3661 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
^{c}#NA’s: the number of missing values.
A summary of the original variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  2.1  103  2 
Median (IQR)  1 (0.61.6)  4 (3.74.4)  139 (135.8142)  29.6 (26.932.5) 
Mean (SD)  1.52 (1.67)  4.1 (0.58)  138.7 (5.67)  29.86 (4.62) 
Maximum^{b}  16.2  10  170  52.6 
#NA’s^{c}  2050  1532  1752  1515 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
^{c}#NA’s: the number of missing values.
To create irregularity in clinical visits, a subset of the continuous data was created by choosing the first 10 observations. Those outcomes are presented in this section, and we have previously elaborated on the method used for addressing this particular scenario. The GCP decomposition was performed with Gaussian loss and
Upon analyzing the results in
When analyzing
The plots show the correlation and distribution of variables generated by sequential trees and the original ones. A correlation matrix displays bivariate scatter plots of the adjacent variables below the diagonal, histograms of the data distribution of the respective variables on the diagonal, and the Kendall correlation above the diagonal. Ellipses specify the direction of the correlation. The information regarding the relationship between the 2 selected variables is always perpendicular to each other. (A) This is the plot of the original variables. (B) This is the plot of the synthetic variables.
The box plots display the variation of the Hellinger distance and the Kendall correlation between the original variables and the sequential decision trees’ synthetic variables. (A) This is the box plot of the absolute differences in bivariate correlations between the real and synthetic data. Smaller values indicate that the bivariate relationships in the data have been greatly preserved during the generation of synthetic data. (B) This is the box plot of the Hellinger distance for all variables between the original and synthetic data sets. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
A summary of the sequential decision trees’ synthetic variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0  0.87  100.3  10.62 
Median (IQR)  1.31 (0.762.82)  4.31 (3.734.88)  139.8 (134.9144.6)  31.43 (27.0235.21) 
Mean (SD)  1.9 (1.18)  4.34 (0.56)  139.5 (5.11)  31.36 (5.7) 
Maximum^{b}  9.35  8.59  178.9  58.15 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
A summary of the original variables.
Metric  Variables  

Creatinine  Potassium  Sodium  Hematocrit 
Minimum^{a}  0.2  2.6  111.2  9.2 
Median (IQR)  1.07 (0.761.7)  4.15 (3.84.5)  139 (135.6142)  31 (27.8335.1) 
Mean (SD)  1.6 (1.64)  4.21 (0.62)  139 (6.53)  31.61 (5.82) 
Maximum^{b}  13.6  7.9  170  52.6 
^{a}Minimum: minimum of data.
^{b}Maximum: maximum of data.
The categorical data contain 2 variables: “admission type” and “admission location.” The GCP decomposition was implemented using 2 different loss functions: the Poisson log link, the results of which we discuss in section S10 in
The structure of the generated data in different modes is included in section S11 in
The plots show the Kendall correlation and distribution of the Hamiltonian Monte Carlo’s synthetic variables as well as the original variables. A correlation matrix displays bivariate scatter plots of the adjacent variables below the diagonal, a bar chart of the data distribution of the respective variables on the diagonal, and the Kendall correlation above the diagonal. Ellipses specify the direction of the correlation. (A) This is the plot of the original variables. (B) This is the plot of the synthetic variables.
The box plot shows the variation of Hellinger distance for all variables between the original and the Hamiltonian Monte Carlo synthetic data sets. This shows the similarity of the univariate distributions between the real and synthetic data. This is a value between 0 and 1, with lower values indicating similarity between the univariate distributions of the real and synthetic variables.
Our objective was to develop and validate a generative model that produces synthetic longitudinal health data. We constructed a model by using a GCP tensor decomposition and sampling from its latent factor matrix, which contains factors related to patients.
We applied the GCP decomposition because tensor decompositions offer interpretability and flexibility in handling highdimensional data, including massive and heterogeneous EHR data sets. However, the most sensible and acceptable privacy concepts were undermined because of the onetoone mapping and direct correspondence between the entries of the GCP model and the entries of the original data. Thus, by simulating and modeling the latent factor matrix of GCP decomposition associated with patients, we could address privacy concerns.
We proposed 3 methods for synthesizing and simulating the patient’s factor matrix: sequential trees, Gaussian copula, and HMC. These techniques appear to be the best options for data synthesis and simulation, particularly when working with complex and small data sets, such as the patient factor matrix in our model.
The model was validated through several experiments conducted on various data structures. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. The assessments involved evaluating the structure and general patterns present in the data, such as the dependency structure, the descriptive statistics, and the marginal distributions.
In this study, we were not able to use the huge data set. In addition, we could not investigate further simulation and sampling techniques for the patient factor matrix due to time constraints. We focused on longitudinal health data in our model. However, there are also other types of longitudinal data, such as transactions in financial data sets, that occur over time.
Therefore, a future study could use a huge data set for this model and explore other techniques for synthesizing the patient factor matrix, such as GANs and recurrent neural network models. Another possible future work could involve conducting a more rigorous comparison between the original and synthetic data sets to evaluate both the generative model and the superior sampling approach. We could also look at the HMC sampling approach and see if we can improve its results by defining more appropriate distributions. We believe that our model could perform well on various types of longitudinal data sets. It would be interesting and valuable to carry out a future study to assess the effectiveness and feasibility of this model on different types of longitudinal data, such as financial data, as the current model has been developed and validated using health data.
There is an increasing demand to access EHRs for secondary analysis. Data synthesis is one method that can address this demand and satisfy privacy concerns simultaneously. The objective of this study was to develop and validate a generative model for producing synthetic longitudinal health data. This was achieved using GCP tensor decomposition and sampling its latent factor matrix, which contains patient factors. All the simulation methods used in the generative model provided the same high level of performance in certain experiments. However, the sequential decision trees performed better when data standardization was used, and the Gaussian loss was used in the generalized CP decomposition. When applied to a nonGaussian latent space, the copula was preferred. Our approach could also solve the problem of sampling patients from EHRs. This means that we could simulate different numbers of patients in the synthetic data set as well. On the basis of our findings, we highly recommend the standardization and decomposition of EHRs using Gaussian loss. This will ensure that the synthetic data is a true reflection of the original data set.
We successfully addressed the challenge of synthesizing massive longitudinal health data by synthesizing a significantly smaller nonlongitudinal data set instead. Thus, it is encouraging that our generative model could be applied to produce valuable synthetic data in various fields and areas of research.
Tensor decompositions have drawn growing attention because of their interpretability and flexibility in highdimensional and heterogeneous data sets. In addition, they can easily be privatized. The GCP decomposition is the most popular tensor decomposition technique, which is ideal for largescale and heterogeneous data sets. It has various applications beyond health data analysis, such as in predicting financial markets. There are significant benefits to banks and financial institutions when it comes to generating and using synthetic data. For instance, they gain the ability to analyze and test data without any cybersecurity or privacy concerns, which is crucial and saves a tremendous amount of time.
Therefore, we believe that our model can efficiently apply to various types of longitudinal data, including generating synthetic longitudinal financial data, such as synthetic transactions.
Detailed analysis results, loss functions, and algorithm descriptions.
cumulative distribution function
canonical polyadic
electronic health record
generative adversarial network
generalized canonical polyadic
Hamiltonian Monte Carlo
Markov chain Monte Carlo
root mean square difference between the actual and synthetic correlations
AS is supported by the NSERC Discovery Grant Program.
KEE is the cofounder and SVP of Replica Analytics Ltd and has equity in this company. LM is a data scientist employed by Replica Analytics Ltd.