Original Paper
Fraunhofer Institute for Software and Systems Engineering, Dortmund, Germany
Corresponding Author:
Simon Scheider, PhD
Fraunhofer Institute for Software and Systems Engineering
Speicherstraße 6
Dortmund, 44147
Germany
Phone: 49 231976774
Email: simon.scheider@isst.fraunhofer.de
Abstract
Background: In the European health care industry, recent years have seen increasing investments in data ecosystems to “FAIRify” and capitalize the ever-rising amount of health data. Within such networks, health metadata catalogs (HMDCs) assume a key function as they enable data allocation, sharing, and use practices. By design, HMDCs orchestrate health information for the purpose of findability, accessibility, interoperability, and reusability (FAIR). However, despite various European initiatives pushing health care data ecosystems forward, actionable design knowledge about HMDCs is scarce. This impedes both their effective development in practice and their scientific exploration, causing huge unused innovation potential of health data.
Objective: This study aims to explore the structural design elements of HMDCs, classifying them alongside empirically reasonable dimensions and characteristics. In doing so, the development of HMDCs in practice is facilitated while also closing a crucial gap in theory (ie, the literature about actionable HMDC design knowledge).
Methods: We applied a rigorous methodology for taxonomy building following well-known and established guidelines from the domain of information systems. Within this methodological framework, inductive and deductive research methods were applied to iteratively design and evaluate the evolving set of HMDC dimensions and characteristics. Specifically, a systematic literature review was conducted to identify and analyze 38 articles, while a multicase study was conducted to examine 17 HMDCs from practice. These findings were evaluated and refined in 2 extensive focus group sessions by 7 interdisciplinary experts with deep knowledge about HMDCs.
Results: The artifact generated by the study is an iteratively conceptualized and empirically grounded taxonomy with elaborate explanations. It proposes 20 dimensions encompassing 101 characteristics alongside which FAIR HMDCs can be structured and classified. The taxonomy describes basic design characteristics that need to be considered to implement FAIR HMDCs effectively. A major finding was that a particular focus in developing HMDCs is on the design of their published dataset offerings (ie, their metadata assets) as well as on data security and governance. The taxonomy is evaluated against the background of 4 use cases, which were cocreated with experts. These illustrative scenarios add depth and context to the taxonomy as they underline its relevance and applicability in real-world settings.
Conclusions: The findings contribute fundamental, yet actionable, design knowledge for building HMDCs in European health care data ecosystems. They provide guidance for health care practitioners, while allowing both scientists and policy makers to navigate through this evolving research field and anchor their work. Therefore, this study closes the research gap outlined earlier, which has prevailed in theory and practice.
doi:10.2196/63396
Keywords
Introduction
Challenges of Health Care Systems
In the 21st century, health care systems worldwide are experiencing a tremendous increase in data, driven by advances in medical technology, digital health records, and wearable devices [Agarwal R, Gao GG, DesRoches C, Jha AK. The digital transformation of healthcare: current status and the road ahead. Inf Syst Res. 2010;21(4):796-809. [CrossRef]1]. This flood of data holds immense potential for data-driven health innovations, building upon large-scale real-world data (RWD) and real-world evidence (RWE) [Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652-660. [FREE Full text] [CrossRef] [Medline]2]. However, health care systems face multiple challenges that hinder the effective use of RWD to generate RWE and thus data-driven health innovations. One primary issue is the integration of heterogeneous datasets [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3]. RWD stem from diverse sources, such as electronic health records, imaging, and genomic data, frequently exhibiting incompatible or unknown data formats, which complicates harmonization, particularly across different entities [Feldman K, Johnson RA, Chawla NV. The state of data in healthcare: path towards standardization. J Healthc Inform Res. 2018;2(3):248-271. [FREE Full text] [CrossRef] [Medline]4,Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5]. Furthermore, finding and accessing suitable RWD represents a hurdle for medical research due to their origins from disparate patient populations, health care systems, and data collection methodologies [Bietz MJ, Bloss CS, Calvert S, Godino JG, Gregory J, Claffey MP, et al. Opportunities and challenges in the use of personal health data for health research. J Am Med Inform Assoc. 2016;23(e1):42-48. [FREE Full text] [CrossRef] [Medline]6]. This impairs the effective discovery of and access to a sufficient number of both available and adequate datasets. Moreover, even if enough RWD are discovered and access is established, another challenge lies in ensuring scientific rigor and reproducibility of generated RWE (ie, medical studies) that becomes increasingly difficult in today’s data-intensive health research [Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. Medical studies constantly require larger, high-quality datasets to generate meaningful and reproducible RWE [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Bietz MJ, Bloss CS, Calvert S, Godino JG, Gregory J, Claffey MP, et al. Opportunities and challenges in the use of personal health data for health research. J Am Med Inform Assoc. 2016;23(e1):42-48. [FREE Full text] [CrossRef] [Medline]6,Wiertz S, Boldt J. Ethical, legal, and practical concerns surrounding the implemention of new forms of consent for health data research: qualitative interview study. J Med Internet Res. 2024;26:e52180. [FREE Full text] [CrossRef] [Medline]8]. However, unknown RWD management practices threaten study reliability, while making the validation of results (ie, RWE) across studies difficult [Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652-660. [FREE Full text] [CrossRef] [Medline]2,Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. Besides that, the diversity of national health care systems increases the prevailing differences in health care data infrastructures across countries that, in turn, lead to additional barriers for organizations to share and use RWD at a large scale [Molnár-Gábor F, Beauvais MJ, Bernier A, Jimenez MP, Recuero M, Knoppers BM. Bridging the European data sharing divide in genomic science. J Med Internet Res. 2022;24(10):e37236. [FREE Full text] [CrossRef] [Medline]9]. Moreover, health care systems must navigate complex legal requirements with regard to sharing and processing RWD [Wiertz S, Boldt J. Ethical, legal, and practical concerns surrounding the implemention of new forms of consent for health data research: qualitative interview study. J Med Internet Res. 2024;26:e52180. [FREE Full text] [CrossRef] [Medline]8,Molnár-Gábor F, Beauvais MJ, Bernier A, Jimenez MP, Recuero M, Knoppers BM. Bridging the European data sharing divide in genomic science. J Med Internet Res. 2022;24(10):e37236. [FREE Full text] [CrossRef] [Medline]9]. For instance, the European jurisdiction mandates strict data governance and security standards that, while essential, impair data-driven health innovations in the absence of adequate data-sharing infrastructures [Molnár-Gábor F, Beauvais MJ, Bernier A, Jimenez MP, Recuero M, Knoppers BM. Bridging the European data sharing divide in genomic science. J Med Internet Res. 2022;24(10):e37236. [FREE Full text] [CrossRef] [Medline]9]. As a result, RWD are fragmented and isolated within single organizations, whereby data sharing and use are limited. Because of all these challenges, the rapidly increasing amount of RWD cannot be harnessed to its full potential for producing health care innovations (ie, RWE).
Metadata Catalogs as a Promising Solution
Against this background, data ecosystems as technical and organizational infrastructures within the healthcare sector represent auspicious solutions, that is, health care data ecosystems (amplified in, eg, the studies by Lovestone and EMIF Consortium [Lovestone S, EMIF Consortium. The European medical information framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [FREE Full text] [CrossRef] [Medline]10], Manogaran et al [Manogaran G, Varatharajan R, Lopez D, Kumar PM, Sundarasekar R, Thota C. A new architecture of internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system. Future Gener Comput Syst. 2018;82:375-387. [CrossRef]11], and Sharon and Lucivero [Sharon T, Lucivero F. Introduction to the special theme: the expansion of the health data ecosystem – rethinking data ethics and governance. Big Data Soc. 2019;6(2):205395171985296. [CrossRef]12]). These evolving networks enable legally compliant use of RWD [Witte AK. A review on digital healthcare ecosystem structure: identifying elements and characteristics.
In: Proceedings of the 23rd Pacific Asia Conference on Information System. 2020. Presented at: PACIS '20; July 9-12, 2020:226-238; Dubai, UAE. URL: https://aisel.aisnet.org/pacis2020/228 [CrossRef]13]. Their key function is market mechanisms instantiated by metadata catalogs [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Scheider S, Lauf F, Möller F, Otto B. A reference system architecture with data sovereignty for human-centric data ecosystems. Bus Inf Syst Eng. 2023;65(5):577-595. [CrossRef]14] that define and describe the intricate web of RWD circulating between a potentially arbitrary number of actors in the ecosystem [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15]. Hence, such health metadata catalogs (HMDCs) are crucial components of modern health care data ecosystems, for example, EHDEN (European Health Data and Evidence Network), EHDS2 (European Health Data Space 2), Elixir, EUCAIM (European Federation for Cancer Images), IDERHA (Integration of Heterogeneous Data and Evidence towards Regulatory and Health Technology Assessment Acceptance), and Gaia-X. Leading European initiatives toward health care data ecosystems.Multimedia Appendix 1
Since HMDCs provide an effective method for systematically sharing and using RWD within data ecosystems [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3], they potentially allow harnessing crucial benefits corresponding to the challenges outlined earlier. First, HMDCs facilitate integrating heterogenous datasets across health care systems. They help to transcend diverse data types, which eases the integration, standardization, and harmonization of data within data ecosystems [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3,Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5]. This is essential for medical research that requires huge pools of accessible RWD appropriate for their investigations [Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. Second, as finding and accessing adequate RWD effectively is vital for medical research [Bietz MJ, Bloss CS, Calvert S, Godino JG, Gregory J, Claffey MP, et al. Opportunities and challenges in the use of personal health data for health research. J Am Med Inform Assoc. 2016;23(e1):42-48. [FREE Full text] [CrossRef] [Medline]6], HMDCs entail added value by offering a governed data search and access framework embedded into the technical infrastructures of the underlying ecosystems [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]. They provide a tool for data discovery to precisely characterize, locate, and filter RWD on the basis of a myriad of factors [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]. Third, HMDCs support transparent and reproducible research processes by helping scientists in replicating studies and validating their results [Ulrich H, Kock-Schoppenhauer A, Deppenwiese N, Gött R, Kern J, Lablans M, et al. Understanding the nature of metadata: systematic review. J Med Internet Res. 2022;24(1):e25440. [FREE Full text] [CrossRef] [Medline]18]. Such transparency is fundamental for building trust in the reliability and validity of RWE [Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652-660. [FREE Full text] [CrossRef] [Medline]2]. Finally, HMDCs facilitate data-intensive research, generally, as they establish unified health care data infrastructures for allocating, accessing, and using RWD of connected data providers [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19]. In doing so, they reduce barriers for organizations to integrate their otherwise isolated RWD within data ecosystems. At the same time, HMDCs bridge prevailing differences between national health care systems and retain full control of data providers [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19]. To this end, they establish robust data security and governance frameworks that are aligned to the applicable jurisdictions [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15].
As a result, HMDCs represent an auspicious medium against the fragmentation and isolation of RWD [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5]. However, since HMDCs are novel constructs, typically in premature phases [Derycke P, Kesisoglou I, Korsgaard T, Aage Huru A, Catsyne CA. Report on the landscape analysis of available metadata catalogues and the metadata standards in use. HaDEA & European Union. URL: https://ehds2pilot.eu/wp-content/uploads/2024/04/HealthData@EU-Pilot_MS6.1_FIN.pdf [accessed 2024-04-29] 20], their ascribed benefits are primarily backed by the literature rather than evidence from practice. Nevertheless, HMDCs are likely to provide means for using RWD systematically within and across health care data ecosystems, potentially resulting in more efficient RWE generation.
In Europe, HMDCs are of particular importance because the EU health care sector exhibits a broad diversity across member states, all with their own health care systems and policies. Consequently, there is a need to focus heavily on standardizing and harmonizing both data and metadata across different countries for facilitating data sharing and legally compliant data use [Peng Y, Bathelt F, Gebler R, Gött R, Heidenreich A, Henke E, et al. Use of metadata-driven approaches for data harmonization in the medical domain: scoping review. JMIR Med Inform. 2024;12:e52967. [FREE Full text] [CrossRef] [Medline]21]. More specifically, the diversity of national health care systems [Lueschen G, van der Zee J. Health Systems in the European Union: Diversity, Convergence, and Integration: A Sociological and Comparative Analysis in Belgium, France, Germany, ... and Spain. München, Germany. Walter de Gruyter; 2016. 22], the restrictiveness of data protection regulations [Wiertz S, Boldt J. Ethical, legal, and practical concerns surrounding the implemention of new forms of consent for health data research: qualitative interview study. J Med Internet Res. 2024;26:e52180. [FREE Full text] [CrossRef] [Medline]8,Molnár-Gábor F, Beauvais MJ, Bernier A, Jimenez MP, Recuero M, Knoppers BM. Bridging the European data sharing divide in genomic science. J Med Internet Res. 2022;24(10):e37236. [FREE Full text] [CrossRef] [Medline]9], and the fragmentation and isolation of health data [Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes. In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes23] make operative health care data ecosystems and HMDCs a paramount concern for the European health care industry. Therefore, this study adopts a European focus.
Theoretical Background
Originally, data catalogs are organized collections of datasets that provide descriptive information within an organization [Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]24,Jahnke N, Otto B. Data catalogs in the enterprise: applications and integration. Datenbank Spektrum. Jun 21, 2023;23(2):89-96. [CrossRef]25]. They act as centralized repositories, making it easier for data consumers to discover, understand, and access the information they need [Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data catalogs: a systematic literature review and guidelines to implementation. In: Proceedings of the 2021 International Conference on Database and Expert Systems Applications. 2021. Presented at: DEXA '21; September 27-30, 2021; Virtual Event. URL: https://link.springer.com/chapter/10.1007/978-3-030-87101-7_15 [CrossRef]26]. Enterprise data management platforms often comprise such centralized data catalogs implying storage of data within their peripheries [Jahnke N, Otto B. Data catalogs in the enterprise: applications and integration. Datenbank Spektrum. Jun 21, 2023;23(2):89-96. [CrossRef]25,Gröger C. There is no AI without data. Commun ACM. Oct 25, 2021;64(11):98-108. [CrossRef]27]. If data are not encapsulated within the organization but integrated into decentral or federated networks [Oliveira M, Barros Lima GD, Farias Lóscio B. Investigations into data ecosystems: a systematic mapping study. Knowl Inf Syst. 2019;61(2):589-630. [CrossRef]28], the literature commonly refers to such environments as data ecosystems with metadata catalogs as key function [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Scheider S, Lauf F, Möller F, Otto B. A reference system architecture with data sovereignty for human-centric data ecosystems. Bus Inf Syst Eng. 2023;65(5):577-595. [CrossRef]14]. This study considers metadata catalogs as decentralized or federated constructs that are mutually exclusive to centralized ones. For simplification, only the term decentralized is used.
Metadata describe dataset attributes, such as source, format, structure, provenance, owner, access, or governance modalities [Sohail SA, Bukhsh FA, van Keulen M. Multilevel privacy assurance evaluation of healthcare metadata. Appl Sci. 2021;11(22):10686. [CrossRef]29]. Metadata catalogs act as “catalogues of data catalogues,” dedicated to enhancing discoverability, usability, and management of distributed datasets [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30]. Within data ecosystems, metadata catalogs are a mechanism that provides a standardized way for recording, disclosing, and making available information about all relevant kinds of phenotypes describing datasets, while ensuring legally compliant access and sharing practices [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3,Scheider S, Lauf F, Möller F, Otto B. A reference system architecture with data sovereignty for human-centric data ecosystems. Bus Inf Syst Eng. 2023;65(5):577-595. [CrossRef]14]. If these datasets are health data, it is henceforth referred to such constructs as HMDCs. Consequently, HMDCs manage heterogenous health information integrated into health care data ecosystems [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5]. They ensure that these diverse and highly sensitive datasets are effectively organized and understood [Feldman K, Johnson RA, Chawla NV. The state of data in healthcare: path towards standardization. J Healthc Inform Res. 2018;2(3):248-271. [FREE Full text] [CrossRef] [Medline]4], while facilitating their systematic use [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3,Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Scheider S, Lauf F, Möller F, Otto B. A reference system architecture with data sovereignty for human-centric data ecosystems. Bus Inf Syst Eng. 2023;65(5):577-595. [CrossRef]14]. This requires dedicated, yet unknown, design elements to be unveiled by the study.
Research Gap, Objective, and Questions
After having clarified the added value of HMDCs, the research gap is demarcated by reviewing related work. Therefrom, the research problem is identified, which leads to the research objectives. These objectives then allow to derive the research questions, required to define a meaningful research methodology.
To begin with, Labadie et al [Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]24] foster the understanding of data catalogs by classifying corresponding initiatives. The authors propose a taxonomy for data catalogs and present 3 case studies. However, similar to Ehrlinger et al [Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data catalogs: a systematic literature review and guidelines to implementation. In: Proceedings of the 2021 International Conference on Database and Expert Systems Applications. 2021. Presented at: DEXA '21; September 27-30, 2021; Virtual Event. URL: https://link.springer.com/chapter/10.1007/978-3-030-87101-7_15 [CrossRef]26] and Jahnke and Otto [Jahnke N, Otto B. Data catalogs in the enterprise: applications and integration. Datenbank Spektrum. Jun 21, 2023;23(2):89-96. [CrossRef]25], Labadie et al [Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]24] focus on intraorganizational data sharing using centralized catalogs. Moreover, they neglected health use cases. Remy et al [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3] conducted a design science study to build an integrated catalog for health research metadata. The artifact enables medical scientists to analyze phenomena that require a view across several domains. The authors are among the first who provide design knowledge usable in HMDC contexts. Although, similar to the findings of the previously presented literature sources, Remy et al [Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]3] accentuate centralized catalogs. Almeida et al [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19] present a platform that provides a set of tools, compliant with the findability, accessibility, interoperability, and reusability (FAIR) principles, to help data holders sharing biomedical databases while allowing data consumers to discover and apply for them. However, the authors only consider a narrow use case instead of generating universally applicable design knowledge. Similarly, Oliveira et al [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15] developed a holistic stakeholder agnostic catalog framework for biomedical datasets. Researchers can explore metadata held decentralized at federated nodes, with distinct levels of granularity being conceivable. Extending this initial design knowledge specific to biomedical data, Swertz et al [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5] proposed a unified framework for sharing health data across catalogs. It encompasses multiple centralized and decentralized catalogs. The authors offer recommendations to establish an integrated community as an open catalog ecosystem. This theoretical basis for HMDCs builds upon and is enriched by similar research. Specifically, Bergeron et al [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17] developed a catalog toolkit to support creating comprehensive as well as user- and study-friendly HMDCs. Almeida and Oliveira [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30] produced a framework to simplify the process of building an HMDC for exposing metadata, while providing analysis capacities. Apparently, there is a tendency from centralized to decentralized data catalogs in health care. However, for HMDCs, a research gap prevails concerning (1) empirically grounded and actionable design knowledge that is (2) universally applicable to (3) the broad array of use cases and EU initiatives associated with health care data ecosystems.
In general, the generation of design knowledge about an artifact is crucial as it provides the intellectual foundation to advance the respective body of scientific knowledge, while facilitating development efforts in practice [vom Brocke J, Winter R, Hevner A, Maedche A. Special issue editorial –accumulation and evolution of design knowledge in design science research: a journey through time and space. J Assoc Inf Syst. 2020;21(3):520-544. [CrossRef]31]. In particular, HMDC design knowledge can harmonize and sustain the multitude of different EU initiatives by following a systematic approach to problem-solving [vom Brocke J, Winter R, Hevner A, Maedche A. Special issue editorial –accumulation and evolution of design knowledge in design science research: a journey through time and space. J Assoc Inf Syst. 2020;21(3):520-544. [CrossRef]31,van Aken JE. Valid knowledge for the professional design of large and complex design processes. Des Stud. 2005;26(4):379-404. [CrossRef]32]. Therefore, its generation must adhere to a rigor design process [Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q. 2004;28(1):75-105. [CrossRef]33,March ST, Storey VC. Design science in the information systems discipline: an introduction to the special issue on design science research. MIS Q. 2008;32(4):725-730. [CrossRef]34]. This process must ensure empirical grounding for the sake of efficiency, effectiveness, and quality assurance, which inevitably favors quality, adaptiveness, and impact of the generated results [Cash PJ. Developing theory-driven design research. Des Stud. 2018;56:84-119. [CrossRef]35]. Likewise, the prevailing lack of design knowledge causes difficulties concerning the adoption and use of HMDCs in practice and theory, revealing the research problem. To remedy this problem, the research objective is to provide actionable design knowledge that is universally applicable in real-world HMDC use cases, thus allowing to infer the following research questions (RQs):
- RQ1:What are taxonomy elements (ie, dimensions and characteristics) to structure HMDCs from a design science perspective?
- RQ2:How does the proposed taxonomy effect real-world use cases?
According to Hevner et al [Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q. 2004;28(1):75-105. [CrossRef]33], a design science perspective means to examine and create information system (IS) artifacts to solve practical problems. A taxonomy is a suitable approach to address RQ1 because it provides a set of elementary building blocks and prescriptions for effectively designing such artifacts [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36,Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]37]. It targets a broad and diverse audience, including health care IS engineers and architects, health data holders and scientists, health care economists and researchers, as well as legal and ethical regulatory bodies, while accentuating the European health care sector.
Methods
Overview
Taxonomies are common approaches in IS research to classify, understand, and examine complex issues [Gregor S. The nature of theory in information systems. MIS Q. 2006;30(3):611-642. [CrossRef]38]. For their development, the method of Nickerson et al [Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]37] is applied to identify dimensions and characteristics of HMDCs. The authors propose generating knowledge conceptually (eg, from the literature) and empirically (eg, analyzing objects of interest). This approach is referred to as the gold standard to build taxonomies in IS research [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36]. As refinement, the methodological update of Kundisch et al [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36] is incorporated, adding an evaluation process by means of focus groups. The authors’ refinement enhances the assessment of value created by the taxonomy [Szopinski D, Schoormann T, Kundisch D. Because your taxonomy is worth IT: towards a framework for taxonomy evaluation.
In: Proceedings of the 27th European Conference on Information Systems. 2019. Presented at: ECIS '19; June 8-14, 2019:25-44; Stockholm-Uppsala, Sweden. URL: https://www.researchgate.net/publication/332711034_Because_your_taxonomy_is_worth_it_Towards_a_framework_for_taxonomy_evaluation39]. Corresponding to these 2 methods, the research design is divided into the 7 steps shown in Figure 1 based on the studies by Nickerson et al [Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]37] and Kundisch et al [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36]. The numbers 1 to 7 represent methodological steps explained in the following sections.

In step 1 in Figure 1, a meta-characteristic is specified in orientation towards the taxonomy’s purpose so that each subordinated characteristic and dimension follows from it. On the basis of RQ1, the meta-characteristic was defined as “distinguishing key design elements of HMDCs.” It facilitates selecting meta-dimensions as well as inferring characteristics and classifying them to dimensions. To define the meta-dimensions, the FAIR framework was used [Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. These well-known data principles postulate an accepted approach to the discoverability and usability of RWD [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis.
In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19]. While FAIR emphasizes making data interoperable and reusable, it inherently involves considerations related to data governance and harmonization [Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]40].
In step 2 in Figure 1, ending conditions for the iterative part of the process are defined, determining its termination criteria. The ending conditions were chosen on the basis of Nickerson et al [Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]37] and Scheider et al [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41] in terms of subjective and objective criteria. Ultimately, 6 design iterations were required until all conditions listed in
Table 1 were fulfilled.
Ending conditions | Design iterations | ||||||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | ||||||||||||
Objective | |||||||||||||||||
All papers were examined. | ✓ | ✓ | ✓ | ✓ | |||||||||||||
No object was merged with another or split. | ✓ | ✓ | ✓ | ||||||||||||||
Each characteristic is classified by one object. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||
No new dimensions or characteristics were added. | ✓ | ✓ | ✓ | ||||||||||||||
Dimensions or characteristics were neither merged nor split. | ✓ | ✓ | ✓ | ||||||||||||||
Each dimension is unique and not duplicated. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||
Every characteristic is unique within its dimension. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||
Each cell is unique and not repeated. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||
Subjective | |||||||||||||||||
Conciseness: no unnecessary dimensions and characteristics | ✓ | ✓ | ✓ | ✓ | |||||||||||||
Robust: dimensions and characteristics differentiate objects | ✓ | ✓ | ✓ | ✓ | |||||||||||||
Comprehensiveness: all objects can be classified | ✓ | ✓ | ✓ | ✓ | |||||||||||||
Extension: dimensions and characteristics can be added easily | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||
Explanatory: dimensions and characteristics describe all objects | ✓ | ✓ | ✓ |
In steps 3 to 5 in Figure 1, we repeatedly chose between either an inductive or a deductive path. The former is a conceptual-to-empirical attempt (step 4a in
Figure 1) to infer dimensions and characteristics from theory. The latter reflects an empirical-to-conceptual procedure (E2C; step 4b in
Figure 1) to derive characteristics from real-world analysis objects and to classify them in dimensions. After each iteration (ie, steps 3 to 5 in
Figure 1), the ending conditions are checked (ie, step 5 in
Figure 1). If all ending conditions are fulfilled, an evaluation step (ie, step 6 in
Figure 1) follows, integrated by focus groups [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36]. In case the focus group iteration does not imply changes to the taxonomy (ie, step 7 in
Figure 1), the artifact is finished and the methodological process terminates. After 5 design iterations (ie, 4 times of executing steps 4a and 4b and executing step 6 once in
Figure 1), all ending conditions were fulfilled (ie, step 5 in
Figure 1) and the subsequent focus group did not result in any major changes (ie, step 7 in
Figure 1). Thus, the taxonomy was completed [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36]. Because 6 design iterations were traversed and it was ensured that the focus group experts covered all dimensions relevant for HMDCs, the taxonomy achieved result saturation.
To ensure transparency of the taxonomy development process, Intermediary results of the iterative research process and key references.Multimedia Appendix 2
Inductive Design Iterations for Taxonomy Development
In the first iteration, an initial set of dimensions and characteristics was derived from former research (ie, step 4a in Figure 1), consolidating the related work addressed in the Introduction section.
In the second iteration, a structured literature review (SLR) was carried out (ie, step 4a in Figure 1) [Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]36]. The method of Kitchenham et al [Kitchenham B, Pearl Brereton O, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering – a systematic literature review. Inf Softw Technol. 2009;51(1):7-15. [CrossRef]79] was applied (ie, 1-6), while orienting toward its application in the study by Scheider et al [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. At the outset, RQ1 was adopted as the (1) research question guiding the SLR. The (2) search process comprised HMDC-related conference and journal papers. The search string was defined as (ALL (health AND data AND catalog) OR ALL (health AND metadata AND catalog) AND ALL (data AND catalog AND technologies)). Primarily, Scopus and IEEE Xplore were used, and the operands were deployed on documents’ titles, abstracts, and authors’ keywords. The 2 databanks were leveraged due to their multidisciplinary nature covering research in all fields relevant for HMDCs. Following Scheider et al [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41], (3) inclusion and exclusion criteria were created to identify and filter papers. First, the literature not available in English was excluded. Second, inaccessible papers were removed. Third, each paper retrieved was reviewed by 2 researchers for whether it covers HMDCs in the broader sense. This means that papers had to address a design perspective, as defined by Hevner et al [Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q. 2004;28(1):75-105. [CrossRef]33]. Articles were emphasized that dealt with “patient-related” data, while ones about aggregated health data (eg, regions and countries) were neglected. The same holds true for catalogs about health-oriented surveys and analysis results (eg, studies). Due to broadly formulated keywords in the search string, initially retrieved literature contained many papers outside the thematical scope. To this end, the third inclusion or exclusion criterion was examined by screening titles and abstracts before reviewing the entire content of the papers. Since 2 researchers constantly worked together in (3), one can argue for reliable objectivity in paper selection.
Building upon the inclusion and exclusion criteria, the initial (4) data collection resulted in 18 papers in the Scopus and IEEE Xplore search (iteration 4 in Figure 2). Subsequently, backward (ie, referenced articles) and forward (ie, citing articles) stepping was conducted [Webster J, Watson RT. Analyzing the past to prepare for the future: writing a literature review. MIS Q. 2002;26(2):13-23. [FREE Full text]80], which added 12 articles.
Figure 2 shows the SLR statistics expressed by a PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart.

The SLR was expanded by a search via Google (Google Search) and the AISeL database for extension and verification. The Google search engine served to identify whitepapers using a consolidated search string compared to (2). For AISeL, the same steps executed in Scopus and IEEE Xplore were applied (ie, 2 and 3), except that the research team looked for the search terms in titles only to keep the number of results feasible. Once duplicates were removed, Google and AISeL added 8 papers to the literature collection. To test theoretical saturation [Mwita K. Factors influencing data saturation in qualitative studies. Int J Bus Soc. Jun 05, 2022;11(4):414-420. [CrossRef]81], “quick searches” were carried out in other databases (eg, ACM) checking whether the top results, first, match the inclusion and exclusion criteria and, second, are not already in the collection. Since these quick searches did not add new papers, the literature collection was considered representative [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. Excluding related work of the first iteration, the collection comprised 38 publications, of which the most important items are listed in Table 2.
Study | Year | Title | |||
Top 10 academic papers | |||||
Alvarellos et al [Alvarellos M, Sheppard HE, Knarston I, Davison C, Raine N, Seeger T, et al. Democratizing clinical-genomic data: how federated platforms can promote benefits sharing in genomics. Front Genet. Jan 10, 2022;13:1045450. [FREE Full text] [CrossRef] [Medline]42] | 2023 | Democratizing clinical-genomic data: How federated platforms can promote benefits sharing in genomics | |||
Almeida and Oliveira [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30] | 2024 | MONTRA2b: A web platform for profiling distributed databases in the health domain | |||
Almeida et al [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19] | 2023 | A FAIRc approach to real-world health data management and analysis | |||
Bergeron et al [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17] | 2018 | Fostering population-based cohort data discovery: The Maelstrom Research cataloguing toolkit | |||
Scheider et al [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41] | 2023 | Exploring design elements of personal data markets | |||
Ehrlinger et al [Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data catalogs: a systematic literature review and guidelines to implementation. In: Proceedings of the 2021 International Conference on Database and Expert Systems Applications. 2021. Presented at: DEXA '21; September 27-30, 2021; Virtual Event. URL: https://link.springer.com/chapter/10.1007/978-3-030-87101-7_15 [CrossRef]26] | 2021 | Data catalogs: a systematic literature review and guidelines to implementation | |||
Labadie et al [Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]24] | 2020 | Fair enough? enhancing the usage of enterprise data with data catalogs | |||
Lovestone and EMIF Consortium [Lovestone S, EMIF Consortium. The European medical information framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [FREE Full text] [CrossRef] [Medline]10] | 2020 | The European medical information framework: a novel ecosystem for sharing health care data across Europe | |||
Oliveira et al [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15] | 2019 | EMIFd Catalogue: a collaborative platform for sharing and reusing biomedical data | |||
Swertz et al [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5] | 2022 | Towards an Interoperable Ecosystem of Research Cohort and Real-world Data Catalogues Enabling Multi-center Studies | |||
Top 5 nonacademic papers | |||||
European Medicines Agency [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43] | 2022 | Good Practice Guide for the use of the Metadata Catalogue of Real-World Sources | |||
European Medicines Agency [List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44] | 2022 | List of metadata for Real World Data catalogues | |||
Directorate-General for Health and Food Safety [Proposal for a regulation - the European Health Data Space. Directorate-General for Health and Food Safety. 2022. URL: https://health.ec.europa.eu/publications/proposal-regulation-european-health-data-space_en#details [accessed 2024-11-07] 82] | 2022 | The European Health Data Space | |||
Jahnke and Otto [Jahnke N, Otto B. Data catalogs in the enterprise: applications and integration. Datenbank Spektrum. Jun 21, 2023;23(2):89-96. [CrossRef]25] | 2022 | Data Catalogs - Implementing Capabilities for Data Curation, Data Enablement and Regulatory Compliance | |||
TEHDASe [EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29] 16] | 2022 | EHDSf Semantic interoperability framework |
aSLR: structured literature review.
bMONTRA2: Modular Next-generation Research Analysis.
cFAIR: findability, accessibility, interoperability, and reusability.
dEMIF: European Medical Information Framework Catalogue.
eTEHDAS: Towards European Health Data Space.
fEHDS: European Health Data Space.
Throughout the steps (2) to (4), a (5) quality assessment step was integrated on the basis of the criteria suggested by Kitchenham et al [Kitchenham B, Pearl Brereton O, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering – a systematic literature review. Inf Softw Technol. 2009;51(1):7-15. [CrossRef]79], that is, inclusion or exclusion criteria, relevant article coverage, literature corpus assessment, and study descriptions.
For (6) data analysis, phrases (“quotes”) from articles with useful content for HMDC designs were extracted. Following the approaches of Saldana [Saldana J. The Coding Manual for Qualitative Researchers. Thousand Oaks, CA. SAGE Publications; 2021. 83] and Pratt [Pratt MG. Fitting oval pegs into round holes. Organ Res Methods. 2007;11(3):481-509. [CrossRef]84], those phrases were coded, inserted in a tabular structure, and iteratively generalized. As in steps (3) and (9), two researchers analyzed the literature to reduce subjectivity biases. Figure 3 shows how quote extractions relating to the dimension of data linking are coded and design implications are derived. Particularly, whenever there was a direct connection to an HMDC context, quotes became design implications immediately, for example, linkage strategy (first quote in
Figure 3). If a direct connection was missing (eg, linkage variable as a new characteristic; third quote in
Figure 3), more evidence was required to transform codes into design implications (ie, second quote in
Figure 3). Finally, considering the influential factors proposed by Mwita [Mwita K. Factors influencing data saturation in qualitative studies. Int J Bus Soc. Jun 05, 2022;11(4):414-420. [CrossRef]81] (eg, study purpose, research design, sample size variability, and analysis approach), data saturation in the SLR is likely.

Deductive Design Iterations for Taxonomy Development
Applying the E2C approach (ie, step 4b Figure 1) in the third and fifth iterations, health data catalogs from practice were initially listed, as identified in the first 2 iterations. This set was extended by a Google search to identify analysis objects not encountered in inductive iterations. The research team searched for analysis objects using the browser’s incognito mode to circumnavigate carryover effects from previous searches [Fruhwirth M, Rachinger M, Prlja E. Discovering business models of data marketplaces.
In: Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020. Presented at: HICSS '20; January 7-10, 2020:5738-5747; Maui, HI. URL: https://scholarspace.manoa.hawaii.edu/server/api/core/bitstreams/cf7bab54-478b-412a-8742-6b02f10dd7ca/content [CrossRef]85]. The keywords from the SLR were used as an orientation to avoid limiting the results unconsciously [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. Analysis objects were excluded if meaningful information could not be obtained. This characteristic was defined as access to analyzable information describing the analysis object, that is, data retrievable either from websites or demo applications [Fruhwirth M, Rachinger M, Prlja E. Discovering business models of data marketplaces.
In: Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020. Presented at: HICSS '20; January 7-10, 2020:5738-5747; Maui, HI. URL: https://scholarspace.manoa.hawaii.edu/server/api/core/bitstreams/cf7bab54-478b-412a-8742-6b02f10dd7ca/content [CrossRef]85]. Analysis objects were also excluded if information was meaningful but unavailable in German or English. However, metadata catalogs under construction were not excluded per se [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. The set of analysis objects was created in the first quarter of 2024. The final analysis objects are listed in
Table 3.
HMDCsa | Classification | Status |
BBMRI-ERICb Data Directory [Data directory. BBMRI-ERIC. URL: https://directory.bbmri-eric.eu/ERIC/directory/#/catalogue [accessed 2024-11-02] 45] | Decentral | Operative |
Catalogue of Mental Health Measures [An interactive catalogue of mental health and wellbeing measures in British cohort and longitudinal studies. Catalogue of Mental Health Measures. URL: https://www.cataloguementalhealth.ac.uk/ [accessed 2024-10-27] 46] | Central | Operative |
Compendium Data Catalog for Healthcare [Data catalogue for healthcare. Compendium. URL: https://compendiumdatacatalog.com/data-catalog/ [accessed 2024-10-28] 47] | Central | Operative |
EHDENc Portal [Findable, standardised data at scale through the EHDEN database catalogue. EHDEN Portal. URL: https://www.ehden.eu/ehden-portal/ [accessed 2024-10-22] 48] | Decentral | Operative |
Elixir (BioSamples [Database (part of Elixir infrastructure). Elixir BioSamples. URL: https://www.ebi.ac.uk/biosamples/ [accessed 2024-12-03] 86] and FAIRsharing [A registry of knowledgebases and repositories of data. Elixir FAIRsharing. URL: https://fairsharing.org/search?fairsharingRegistry=Database [accessed 2024-12-04] 87]) | Decentral | Operative |
EMIFd Data Catalogue [EMIF catalogue. EMIF. URL: https://www.emif.eu/emif-catalogue/ [accessed 2024-10-19] 49] | Decentral | In progress |
EUCAIMe Cancer Image Europe [Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25] 50] | Decentral | Operative |
European Health Information Portal [Welcome to the one-stop shop that facilitates access to population health and health care data, information and expertise across Europe. European Health Information Portal. URL: https://www.healthinformationportal.eu/ [accessed 2024-10-18] 51] | Central | Operative |
Fjelltopp Data Catalogues for Health [Data catalogues for health. Fjelltopp. URL: https://www.fjelltopp.org/service/data-catalogues-for-health/ [accessed 2024-10-28] 52] | Central | Operative |
HealthRIf Data Catalogues [Data catalogues. Health RI. URL: https://catalogus.healthdata.nl/datasets [accessed 2024-10-29] 53] | Decentral | Operative |
Helsedata Explore Data Sources [Explore data sources. Helsedata. URL: https://helsedata.no/en/ [accessed 2024-10-29] 54] | Decentral | Operative |
IDERHAg Metadata Catalogue (no public access) | Decentral | In progress |
IHMEh Global Health Data Exchange [Global health data exchange. Institute for Health Metrics and Evaluation. URL: https://ghdx.healthdata.org/ [accessed 2024-10-29] 88] | Central | Operative |
IQVIAi Health Data Catalogue [Health data catalog. IQVIA. URL: https://www.iqvia.com/library/fact-sheets/iqvia-health-data-catalog [accessed 2024-09-25] 55] | Central | Operative |
Kraken Health Data Pilot [Health pilot. Kraken. URL: https://www.krakenh2020.eu/pilots/health [accessed 2024-11-01] 56] | Decentral | In progress |
Lifebit Precision Medicine Data Catalogue [Precision medicine data catalog. Lifebit. URL: https://www.lifebit.ai/federated-data-catalogue [accessed 2024-11-02] 57] | Decentral | Operative |
MACHj Clinical and Research Data Catalogue [MACH clinical and research dataset catalogue. Melbourne Academic Centre for Health. URL: https://figshare.unimelb.edu.au/MACH-catalogue?searchMode=1 [accessed 2024-11-24] 89] | Decentral | Operative |
Maelstrom Research Data Catalogue [Maelstrom catalogue. Maelstrom Research. URL: https://www.maelstrom-research.org/page/catalogue [accessed 2024-11-26] 58] | Decentral | Operative |
Yoda Trials Data Catalogue [Trials data catalogue. Yoda. URL: https://yoda.yale.edu/trials-search/ [accessed 2024-10-19] 59] | Decentral | In progress |
aHMDC: health metadata catalog.
bBBMRI-ERIC: Biobanking and Biomolecular Resources Research Infrastructure – European Research Infrastructure Consortium.
cEHDEN: European Health Data and Evidence Network.
dEMIF: European Medical Information Framework.
eEUCAIM: European Federation for Cancer Images.
fHealthRI: Health Research Infrastructure.
gIDERHA: Integration of Heterogeneous Data and Evidence towards Regulatory and Health Technology Assessment Acceptance.
hIHME: Institute for Health Metrics and Evaluation.
iIQVIA: Information, Quality, Value, Innovation, and Access.
jMACH: Melbourne Academic Centre for Health.
The catalogs were examined by classifying them alongside the design elements of the preliminary taxonomy ( Intermediary results of the iterative research process and key references.Multimedia Appendix 2
Table 3).
Evaluative Design Iterations for Taxonomy Development and Use Case Cocreation
Following Leading European initiatives toward health care data ecosystems. List of focus group members and distribution to sessions.Figure 1, an evaluation by focus groups is needed after the fulfillment of all ending conditions (ie, in step 6). Focus groups help gather more data than individual interviews, since experts respond to the input of others, triggering discussions and idea generation [Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes.
In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes23]. According to Szopinski et al [Szopinski D, Schoormann T, Kundisch D. Because your taxonomy is worth IT: towards a framework for taxonomy evaluation.
In: Proceedings of the 27th European Conference on Information Systems. 2019. Presented at: ECIS '19; June 8-14, 2019:25-44; Stockholm-Uppsala, Sweden. URL: https://www.researchgate.net/publication/332711034_Because_your_taxonomy_is_worth_it_Towards_a_framework_for_taxonomy_evaluation39], focus groups are particularly suitable to assess the comprehensiveness, robustness, understandability, and extensibility of a taxonomy, as well as the shape of dimensions and characteristics. Members were recruited on the basis of the target audience of the taxonomy (see section about Research gap, objective, and questions). We ensured that they have substantial knowledge with regard to HMDCs, stemming from EU initiatives (
Multimedia Appendix 1
Multimedia Appendix 3
In the fourth iteration, the focus group resulted in substantial changes of dimensions and characteristics, particularly regarding taxonomy elements of data accessibility and findability ( Intermediary results of the iterative research process and key references. List of focus group members and distribution to sessions.Multimedia Appendix 2
Multimedia Appendix 3
Ethical Considerations
In the run-up to the sessions, the experts received an information sheet explaining the study context, the procedure, and the approach to gathering and processing interview data. It was assured that no personal information will be disclosed. Experts were informed about their right to opt out. During the focus group sessions, only anonymous interview data in the form of handwritten notes were collected. Neither a video or voice recording nor the transcription of interview material was used. Experts did not receive any financial compensation. Hence, conducting focus groups did not require the official approval of an ethics review committee.
Results
Taxonomy for FAIR HMDCs
Overview
Intermediary results of the iterative research process and key references.Table 4 shows the taxonomy containing 20 dimensions (Dn) and 101 characteristics (Cnm) structured alongside the FAIR data principles as meta-dimensions [Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]40]. For visualization, morphologies were considered (
Multimedia Appendix 2
Table 4 is that OR is used for nonexclusive characteristics, while XOR is used for exclusive ones.
Dimension (Dn) | Characteristics (Cnm) | Exclusive or nonexclusive | ||
Findability | ||||
D1: data source | C1.1: patients’ health OR C1.2: medical procedures OR C1.3: medical products OR C1.4: others | Nonexclusive | ||
D2: managerial. details | C2.1: holder OR C2.2: origin OR C2.3: collection OR C2.4: qualification OR C2.5: financials OR C2.6: others | Nonexclusive | ||
D3: data type | C3.1: admin XOR C3.2: primary care XOR C3.3: secondary care XOR C3.4: registries XOR C3.5: others | Exclusive | ||
D4: population information | C4.1: disease OR C4.2: family linkages OR C4.3: lifestyle factors OR C4.4: population OR C4.5: sociodemographic OR C4.6: catchment area coverage | Nonexclusive | ||
D5: data sensitivity | C5.1: synthetic data XOR C5.2: anonymized data XOR C5.3: pseudonymized data XOR C5.4: personal data | Exclusive | ||
Accessibility | ||||
D6: catalogue accessibility | C6.1: public XOR C6.2: hybrid XOR C6.3: private | Exclusive | ||
D7: dataset accessibility | C7.1: free XOR C7.2: formal request XOR C7.3: strictly limited XOR C7.4: others | Exclusive | ||
D8: access control | C8.1: catalog operator OR C8.2: internal DACa OR C8.3: external DAC OR C8.4: none OR C8.5: others | Nonexclusive | ||
Interoperability | ||||
D9: program discoverability | C9.1: Beacon OR C9.2: BBMRI-MIABISd OR C9.3: bioimage OR C9.4: CESSDAe OR C9.5: DCATf OR C9.6: ECRIN-CRMDRg OR C9.7: FairShairing OR C9.8: INSPIREh OR C9.9: PHIRIi OR C9.10: others | Nonexclusive | ||
D10: semantic interoperability | C10.1: CDISC-SDTMj OR C10.2: LOINCk OR C10.3: OMOPl OR C10.4: Oorphanet standards OR C10.5: SNOMEDm OR C10.6: others | Nonexclusive | ||
D11: interoperable communication | C11.1: DICOMn OR C11.2: HL7 FHIRo OR C11.3: IDMPp OR C11.4: ISO 800-110q OR C11.5: others | Nonexclusive | ||
D12: CDMr | C12.1: type OR C12.2: reference OR C12.3: release frequency | Nonexclusive | ||
D13: ETLs status | C13.1: planned XOR C13.2: in progress XOR C13.3: completed | Exclusive | ||
D14: vocabularies | C14.1: medicinal product OR C14.2: cause of death OR C14.3: quality of life measurement OR C14.4: prescription OR C14.5: dispensing OR C14.6: indication OR C14.7: procedures OR C14.8: genetic data OR C14.9: biomarker data OR C14.10: medical event | Nonexclusive | ||
Reusability | ||||
D15: collection methodology | C15.1: collection governance OR C15.2: collection process OR C15.3: dataset updates OR C15.4: others | Nonexclusive | ||
D16: collection events | C16.1: patient encounter OR C16.2: physical examination OR C16.3: diagnostics OR C16.4: treatment OR C16.5: progress note OR C16.6: communication OR C16.7: regulatory OR C16.8: others | Nonexclusive | ||
D17: data linkage | C17.1: strategy OR C17.2: variable OR C17.3: completeness OR C17.4: cross-reference OR C17.5: none | Nonexclusive | ||
D18: data preservation | C18.1: definite records XOR C18.2: indefinite records | Exclusive | ||
D19: publish | C19.1: approval needed XOR C19.2: no approval needed | Exclusive | ||
D20: informed consent | C20.1: not required XOR C20.2: general use XOR C20.3: all studies XOR C20.4: specific studies XOR C20.5: waiver XOR C20.6: others | Exclusive |
aFAIR: Findability, Accessibility, Interoperability, and Reusability.
bHMDC: health metadata catalog.
cDAC: Data Access Committee.
dBBMRI-MIABIS: Biobanking and Biomolecular Resources Research Infrastructure-Minimum Information About Biobank Data Sharing.
eCESSDA: Consortium of European Social Science Data Archives.
fDCAT: Data Catalog vocabulary.
gECRIN-CRMDR: European Clinical Research Infrastructure Network – Clinical Research Metadata Repository.
hINSPIRE: Infrastructure for Spatial Information in Europe.
iPHIRI: Population Health Research Infrastructure.
jCDISC-SDTM: Clinical Data Interchange Standards Consortium - Study Data Tabulation Model.
kLOINC: Logical Observation Identifiers Names and Codes.
lOMOP: Observational Medical Outcomes Partnership.
mSNOMED: Systematized Nomenclature of Medicine.
nDICOM: Digital Imaging and Communications in Medicine.
oHL7 FHIR: Health Level 7 Fast Healthcare Interoperability Resources.
pIDMP: Identification for Medicinal Products.
qISO800-110: International Organization for Standardization 800-110.
rCDM: common data model.
sETL: extract, transform, load.
Data Findability
The meta-dimension prescribes that datasets orchestrated by an HMDC must be easily discoverable, requiring metadata assets to describe essential attributes of the decentral datasets [Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]40]. The dimension data source (D1) refers to abstract categories for data classification in the catalog system. Following European Medicines Agency guidelines [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43], and implementations in practice [EMIF catalogue. EMIF. URL: https://www.emif.eu/emif-catalogue/ [accessed 2024-10-19] 49,Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25] 50], patients’ health (C1.1) comprises datasets attributable to conditions. Examples are diseases, causes of death, prescriptions and dispensing of medicines, clinical measurements, genetic data, units of health care use, and all other similar patient-generated data, for example, wearables. Medical procedures (C1.2) encompass data describing hospital admission discharges, intensive care admissions, administration of vaccines or other injectables, medical operations, biomarker data, and diagnostic codes [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]. The latter includes, among others, the International Classification of Disease Code, the Major Comorbidity Code, and the Major Diagnostic Code. Medical products (C1.3) span categories like prescribed medicinal products for human use, contraception, indication for use, and medical device data. Others (C1.4) may refer to further data, for example, health care providers delivering diagnosis and treatment services [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. The dimension managerial details (D2) captures crucial organizational metadata to be disclosed by the HMDC as part of a dataset’s metadata asset [Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]24]. Above all, the data holder (C2.1) must be publicized, including contact details (ie, data steward). This entity sustains the record collection in an underlying dataset [Shi J, Zheng M, Yao L, Ge Y. DIR — a semantic information resource for healthcare datasets. In: Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine. 2017. Presented at: BIBM '17; November 13-16, 2017:805-810; Kansas City, MO. URL: https://ieeexplore.ieee.org/abstract/document/8217758 [CrossRef]60]. The origin (C2.2) of the data refers to the countries or geographical regions of their acquisition [Shi J, Zheng M, Yao L, Ge Y. DIR — a semantic information resource for healthcare datasets. In: Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine. 2017. Presented at: BIBM '17; November 13-16, 2017:805-810; Kansas City, MO. URL: https://ieeexplore.ieee.org/abstract/document/8217758 [CrossRef]60] and the language [List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]. The characteristic of collection (C2.3) details the acquisition dates as well as all data assemblage information, except collection methodology and events. If the dataset has received a formal qualification (C2.4), this should also be disclosed in the metadata asset [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. The same holds for sources of finance (C2.5) having sponsored the dataset creation [Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes. In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes23,McCoy D, Chand S, Sridhar D. Global health funding: how much, where it comes from and where it goes. Health Policy Plan. 2009;24(6):407-417. [CrossRef] [Medline]61], for example, data holder, public, industry, research, or patient organizations [List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]. Naturally, other (C2.6) metadata attributes specifying managerial details are conceivable. The dimension data type (D3) describes broader content related categories applicable to the dataset. It mainly distinguishes datasets containing administrative (C3.1 [Yurkovich M, Avina-Zubieta JA, Thomas J, Gorenchtein M, Lacaille D. A systematic review identifies valid comorbidity indices derived from administrative health data. J Clin Epidemiol. 2015;68(1):3-14. [CrossRef] [Medline]62]), primary (C3.2) and secondary care (C3.3 [Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63]), registry (C3.4), and other (C3.5) data types. Furthermore, population information (D4) as metadata attribute refers to the specifics of the records within a dataset [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]. The taxonomy narrows the dimension down to collected disease information (C4.1); particular population specifics (C4.2; eg, age groups); family linkages (C4.3; eg, household, parent-child, sibling, and not applicable) lifestyle factors (C4.4; eg, tobacco use, physical exercises, and diet); sociodemographic data (C4.5; eg, gender, ethnicity, education, and deprivation index); and (C4.6) catchment area coverage [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17,Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]. Data sensitivity (D5) addresses the identifiability of data subjects of whom records were collected. Records can be synthetic (C5.1), anonymized (C5.2), pseudonymized (C5.3), or contain personal data (C5.4) [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15,Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41].
Data Accessibility
Once datasets are findable in the HMDC, their accessibility must be facilitated. By their design, HMDCs must ensure legal and ethical compliance of all data access and sharing processes in their health care ecosystems [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19]. This particularly involves that datasets can be retrieved by authorized users only, implying rigor access control functions [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5,Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. Accordingly, the dimension catalogue accessibility (D6) refers to access modalities for data consumers to use the HMDC. HMDCs can be public (C6.1) allowing anyone to browse metadata assets and discover datasets. Alternatively, HMDCs can be private (C6.2), limited to a certain number of users who have been formally authorized by a dedicated authority. In addition, hybrid (C6.3) forms exist. Dataset accessibility (D7) describes the access modalities for data consumers to published datasets [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30]. The access can be free (C7.1), implying that data must at least be anonymized or synthetic to comply with legal and ethical guidelines [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41,Brunswick D. Data privacy, data protection, and the importance of integration for GDPR compliance. Isaca J. 2019;1:14-27.64]. Datasets can also require a formal request (C7.2) with a Data Access Committee (DAC) or a data steward deciding upon access requests [Yang K, Jia X, Ren K, Zhang B, Xie R. DAC-MACS: effective data access control for multiauthority cloud storage systems. In: Proceedings of the 2013 IEEE Conference on Computer Communications Workshops. 2013. Presented at: INFOCOM '13; April 14-19, 2013:2895-2903; Turin, Italy. URL: https://ieeexplore.ieee.org/document/6567100 [CrossRef]65,Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]. HMDCs frequently require such requests of data consumers to be approved by their ethic committees before processing them within the ecosystem [Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]. Moreover, data access can be strictly limited (C7.3) to a demarcated group of data consumers. Although, members of such limited groups also need to make formal requests for data access [Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]. This implies that other (C7.4) data access modalities exist, especially combinations of C7.1 to C7.3. Access control (D8) refers to the mechanisms implemented by HMDCs that facilitate the aforementioned decision-making by empowered entities [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30]. The dimension specifies the entities who determine whether data consumers receive the requested datasets and are allowed to perform which kinds of processing activities [Munoz-Arcentales A, López-Pernas S, Pozo A, Alonso Á, Salvachúa J, Huecas G. An architecture for providing data usage and access control in data sharing ecosystems. Procedia Comput Sci. 2019;160:590-597. [CrossRef]67]. It distinguishes the catalog operator (C8.1); internal DACs at the sides of the data holders (C8.2 [Yang K, Jia X, Ren K, Zhang B, Xie R. DAC-MACS: effective data access control for multiauthority cloud storage systems. In: Proceedings of the 2013 IEEE Conference on Computer Communications Workshops. 2013. Presented at: INFOCOM '13; April 14-19, 2013:2895-2903; Turin, Italy. URL: https://ieeexplore.ieee.org/document/6567100 [CrossRef]65,Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]); external DACs (C8.3) that are run centrally by an independent third party [Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes. In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes23,Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]; the absence of access control (C8.4; ie, free data [C7.1]); and any other forms (C8.5). Generally, access control in HMDC designs is crucial for maintaining data security, confidentiality, and compliance with legal and ethical constraints [Munoz-Arcentales A, López-Pernas S, Pozo A, Alonso Á, Salvachúa J, Huecas G. An architecture for providing data usage and access control in data sharing ecosystems. Procedia Comput Sci. 2019;160:590-597. [CrossRef]67].
Data Interoperability
A core objective of HMDCs is to enable data consumers accumulating datasets across organizations, effectively, to create meaningful connections and analyses [Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. This makes data interoperability crucial [Pine KH. The qualculative dimension of healthcare data interoperability. Health Informatics J. 2019;25(3):536-548. [FREE Full text] [CrossRef] [Medline]91]. To that end, HMDCs leverage specific standards described in this meta-dimension. Thereof, programmatic discoverability (D9) refers to the ability of data consumers to programmatically query, access, and retrieve metadata assets and search for their attributes. It is defined by the joint action Towards European Health Data Space (TEHDAS) [EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29] 16] as the ability to identify, access, and understand health data by automated means. Associated approaches commonly involve application programming interfaces or similar programmatic methods to access and filter metadata in the HMDC [Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]19]. Following the TEHDAS community [EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29] 16], the dimension is narrowed down to the most frequent standards used by HMDCs. These are:
- Beacon (C9.1),
- Biobanking and biomolecular resources research infrastructure-minimum information about biobank data sharing (BBMRI-MIABIS; C9.2),
- Bio-image archive (C9.3),
- Consortium of European Social Science Data Archives (CESSDA; C9.4),
- Data catalog vocabulary (DCAT; C9.5),
- European Clinical Research Infrastructure Network – clinical research metadata repository (ECRIN-CRMDR; C9.6),
- FairSharing (C9.7),
- Infrastructure for Spatial Information in Europe (INSPIRE; C9.8),
- Population Health Research Infrastructure (PHIRI; C9.9), and
- Others (C9.10).
HMDCs typically adhere to one of those standards to ensure programmatic discoverability of published data offerings (ie, the metadata assets). The dimension semantic interoperability (D10) ensures that the precise format and meaning of datasets is preserved and understood, covering both semantic and syntactic aspects [Ngouongo SM, Löbe M, Stausberg J. The ISO/IEC 11179 norm for metadata registries: does it cover healthcare standards in empirical research? J Biomed Inform. 2013;46(2):318-327. [FREE Full text] [CrossRef] [Medline]68]. Similar to D9, the characteristics of this dimension encompass standards commonly applied by HMDCs. These are
- Clinical Data Interchange Standards Consortium - Study Data Tabulation Model (CDISC-SDTM; C10.),
- Logical Observation Identifiers Names and Codes (LOINC; C10.2),
- Observational Medical Outcomes Partnership (OMOP; C10.3),
- Orphanet (C10.4),
- Systematized Nomenclature of Medicine (SNOMED; C10.5 [EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29] 16,Ngouongo SM, Löbe M, Stausberg J. The ISO/IEC 11179 norm for metadata registries: does it cover healthcare standards in empirical research? J Biomed Inform. 2013;46(2):318-327. [FREE Full text] [CrossRef] [Medline]68,de Mello BH, Rigo SJ, da Costa CA, da Rosa Righi R, Donida B, Bez MR, et al. Semantic interoperability in health records standards: a systematic literature review. Health Technol (Berl). 2022;12(2):255-272. [FREE Full text] [CrossRef] [Medline]69] and
- Others (C10.6).
Subsequently, the dimension interoperable communication (D11) comprises approaches implemented by HMDCs to facilitate seamless and effective data sharing between data holders and consumers [Iroju O, Soriyan A, Gambo I, Olaleke J. Interoperability in healthcare: benefits, challenges and resolutions. Int J Innov Appl Stud. 2013;3(1):262-270.70]. Approaches typically used for interoperable communication are Digital Imaging and Communications in Medicine (DICOM; C11.1), Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR; C11.2), Identification for Medicinal Products (IDMP; C11.3), and International Organization for Standardization (ISO) 800-110 (C11.4) [EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29] 16,Roehrs A, da Costa CA, da Rosa Righi R, Rigo SJ, Wichman MH. Toward a model for personal health record interoperability. IEEE J Biomed Health Inform. 2019;23(2):867-873. [CrossRef]71]. As for the previous dimensions, other standards (C11.5) are conceivable. For D9 to D11, detailed information is easily available in the web.
The following dimensions deal with “data harmonization” as a crucial aspect of data interoperability. They comprise HMDC design elements for standardizing disparate datasets. The purpose is to ensure consistency and coherence of all datasets classifiable to the same data type (D3). Data harmonization aims to create a unified and cohesive view on datasets, enhancing their allocation, sharing, and use [Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]7]. The common data model (CDM; D12) describes the specifications relating to the structured representation of data records within datasets [Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]5]. The CDM unfolds implications to the relationships between these records, as well as the rules and possibilities for data use. It defines how data consumers can access and process datasets, thus providing the foundation for data consistency, interoperability, and orchestration [Lovestone S, EMIF Consortium. The European medical information framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [FREE Full text] [CrossRef] [Medline]10,Kent S, Burn E, Dawoud D, Jonsson P, Østby JT, Hughes N, et al. Common problems, common data model solutions: evidence generation for health technology assessment. Pharmacoeconomics. 2021;39(3):275-285. [FREE Full text] [CrossRef] [Medline]72]. The taxonomy distinguishes the CDM type (C12.1) [Lovestone S, EMIF Consortium. The European medical information framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [FREE Full text] [CrossRef] [Medline]10,Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Kent S, Burn E, Dawoud D, Jonsson P, Østby JT, Hughes N, et al. Common problems, common data model solutions: evidence generation for health technology assessment. Pharmacoeconomics. 2021;39(3):275-285. [FREE Full text] [CrossRef] [Medline]72], the CDM references (C12.2), for example, websites or publications, and the release frequency of CDM specification updates (C12.3) [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. Furthermore, information about datasets on their transformation status (extract, transform, load [ETL]) to a CDM should be provided [List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]. This ETL status (D13) can be described as planned (C13.1), in progress (C13.2), or completed (C13.3), indicating the readiness of the dataset for use. Finally, HMDCs leverage vocabularies (D14) as sets of fixed terms, labels, or identifiers to describe and categorize the metadata assets [Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63]. Vocabularies facilitate understanding, discovery, and allocation of metadata with a consistently applied language [Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63,Ivanović M, Budimac Z. An overview of ontologies and data resources in medical domains. Expert Syst Appl. 2014;41(11):5158-5166. [CrossRef]73]. The dimension distinguishes 10 characteristics for classifying vocabularies on the basis of pertinent literature: medicinal product (C14.1), cause of death (C14.2), quality of life measuring (C14.3), prescription (C14.4), dispensing (C14.5), indication (C14.6), procedures (C14.7), genetic data (C14.8), biomarker data (C14.9), and medical event (C14.10) [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44,Ivanović M, Budimac Z. An overview of ontologies and data resources in medical domains. Expert Syst Appl. 2014;41(11):5158-5166. [CrossRef]73].
Data Reusability
FAIR datasets must be created and documented in a way that allows reusage for different purposes. For HMDCs, this implies providing contextual information beyond the metadata dimensions associated with data findability [Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]40]. Collection methodology (D15) encompasses characteristics that are associated with how data records were created [Hentschel J. Contextuality and data collection methods: a framework and application to health service utilisation. J Dev Stud. 1999;35(4):64-94. [CrossRef]74]. Thereof, collection governance (C15.1) addresses information about data capture, demonstrating legal and ethical compliance [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. This includes data quality checks and validation activities [Tijhuis M, Finger JD, Slobbe L, Sund R, Tolonen H. Data collection. In: Marieke V, van Oers H, editors. Population Health Monitoring: Climbing the Information Pyramid. Cham, Switzerland. Springer; 2019:59-81.75]. The latter may also refer to the question of whether the dataset allows access to the actual records. Furthermore, the collection process (C15.2) outlines how records in the dataset were created [Hentschel J. Contextuality and data collection methods: a framework and application to health service utilisation. J Dev Stud. 1999;35(4):64-94. [CrossRef]74], for example, surveys, questionnaires, or data retrieval from hospital IS. Dataset updates (C15.3) disclose refreshment dates of datasets, for instance, fixed dates around the year [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. Naturally, the collection methodology can contain other (C15.4) use case–specific characteristics as additional metadata attributes. Similar to D15, collection events (D16) narrows down the categories of incidents having triggered the creation of a record in the dataset [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]. The dimension comprises the characteristics of patient encounter (C16.1; eg, interactions with health care providers); physical examination (C16.2; eg, patient’s health examined by a professional); diagnostics (C16.3; eg, results of medical condition checks); treatment (C16.4; eg, documentation of conditions and treatment plans); progress notes (C16.5; eg, changes in patients’ health status, responses to treatment, or modifications of care plans); communication (C16.6; eg, information exchanged by health care providers); regulatory (C16.7; eg, legally required documentation of patients’ care); and others (C16.8) [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44].
The dimension data linkage (D17) describes whether and how a dataset was created by linking others [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44,Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]76]. The metadata should disclose the linkage strategy (C17.1) which could be deterministic, probabilistic, or both. In addition, the used linkage variable (C17.2) should be published, along with the completeness of data linkage (C17.3). Ideally, the linked datasets should be cross-referred (C17.4) and, if applicable, their availability in the HMDC highlighted. In case no data linkage was applied, no corresponding metadata attribute is provided (C17.5). Furthermore, data preservation (D18) indicates whether records in the dataset are preserved indefinitely (C18.1) or, if not (C18.2), the time for which they are specified [Rasmussen KB, Blank G. The data documentation initiative: a preservation standard for research. Arch Sci. 2007;7(1):55-71. [CrossRef]77]. Publishing constraints (D19) provides information to data consumers whether an approval of the data holder (C19.1) is needed to publish results obtained from using the dataset or an approval is not needed (C19.2) [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. In the former case, the kind of approval and the approval process should be described. Finally, metadata assets of HMDCs should reveal whether informed consent (D20) was obtained or needs to be obtained for data processing [Angrist M. Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Per Med. 2009;6(6):691-699. [FREE Full text] [CrossRef] [Medline]78]. Generally, the characteristics not required (C20.1), required for general use (C20.2), required for all studies (C20.3), required for specific studies (C20.4), waiver (C20.5), and other (C20.6) are recommendable [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43].
Cocreated HMDC Use Cases
Overview
The usability, effectiveness, and accuracy of the taxonomy are amplified by 4 “illustrative scenarios” for HMDCs that demonstrate how the FAIR dimensions and characteristics are reflected in real-world use cases [Szopinski D, Schoormann T, Kundisch D. Because your taxonomy is worth IT: towards a framework for taxonomy evaluation.
In: Proceedings of the 27th European Conference on Information Systems. 2019. Presented at: ECIS '19; June 8-14, 2019:25-44; Stockholm-Uppsala, Sweden. URL: https://www.researchgate.net/publication/332711034_Because_your_taxonomy_is_worth_it_Towards_a_framework_for_taxonomy_evaluation39]. These use cases facilitate the taxonomy’s tangibility and the ascertainment of its practical implications, while triangulating the results. As such, they add depth and context to the taxonomy [Szopinski D, Schoormann T, Kundisch D. Because your taxonomy is worth IT: towards a framework for taxonomy evaluation.
In: Proceedings of the 27th European Conference on Information Systems. 2019. Presented at: ECIS '19; June 8-14, 2019:25-44; Stockholm-Uppsala, Sweden. URL: https://www.researchgate.net/publication/332711034_Because_your_taxonomy_is_worth_it_Towards_a_framework_for_taxonomy_evaluation39]. Originally, 6 abstract application scenarios were derived from recommendations of EMA [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29]
43]. Building upon this, we continuously developed and refined those scenarios on the basis of the insights gained during the taxonomy design iterations in general and the focus groups in particular. In the latter, we relied on the experts’ reflections on what they expect from an HMDC and whether our dimensions and characteristics meet their expectations, contradict them, or miss out on certain aspects. In the section about the deductive design iterations, we have already described how the focus groups were conducted. We ensured that the experts possess extensive expertise relevant to HMDC designs, either from a development (ie, technology or legal) or a user perspective (ie, data consumer or provider; List of focus group members and distribution to sessions. List of focus group members and distribution to sessions.Multimedia Appendix 3
Multimedia Appendix 3
Study Planning
Use case 1 is as follows: a data consumer wants to identify suitable datasets for a planned study.
The HMDC must enable data consumers to effectively identify datasets for medical research studies [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15] by implementing the following process: First, a data consumer who wants to access the HMDC, needs to be authorized as a qualified user (D6; #I [Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25] 50]). Second, this authorized user must be able to browse and filter published metadata assets to discover relevant datasets that fulfill specifications of an intended study (#V and VII [EMIF catalogue. EMIF. URL: https://www.emif.eu/emif-catalogue/ [accessed 2024-10-19] 49,Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25] 50]). For example, detailed data type [Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63] or population information of the metadata assets should be disclosed to enable verifying the relevance of datasets (D3 and D4; #I, II, V, and VII [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17,Maelstrom catalogue. Maelstrom Research. URL: https://www.maelstrom-research.org/page/catalogue [accessed 2024-11-26] 58]). Third, the HMDC must allow to check managerial details concerning information about the data holder, origin, collection, qualification, and financials (D2 [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25] 50]), including the eligibility to receive synthetic, anonymized, pseudonymized, or personal data (D5; #IV [Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]15]). Subsequently, the data consumer must be facilitated to perform a preliminary assessment of datasets regarding their relevance for the planned study (#III and VII). At this stage, a first list of candidates should be possible to be established. Ideally, the data consumer can access links (ie, cross-references) within the metadata assets to identify former studies which were performed with the same dataset, addressing similar research questions (C17.4; #II and VII [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]76]). Such studies are typically accessed outside the contextual boundaries of HMDCs (#IV). Finally, depending on the governance modalities of selected datasets (D7), the HMDC must enable the data consumer to request data access (#I-VII [Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]30]). Therefore, an official data order needs to be submitted by the HMDC on behalf of the data consumer, containing specifications about the planned study and required documents, for example, protocols and ethical assessments (#I-III). With respect to the datasets accessibility constraints (D7) and their associated access control (D8) characteristics, the HMDC must forward data orders to the data holders (C8.2 [Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]66]), external third parties (C8.3 [Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes. In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes23]) or determine request permission or denial decisions itself (C8.1 and C8.4; #I, III, and VI).
Study Assessment
Use case 2 is as follows: a dataset is mentioned in a conducted study. The data consumer wants to evaluate this study based on the suitability of the datasets used therein.
Given that the datasets used in a conducted study are available in the HMDC, data consumers must be enabled to verify, in retrospective, the suitability of these datasets (#I, II, and VII). To support such evaluations, the HMDC must provide different parts of the metadata asset, depending on the nature of the conducted study (#II and VII). For example, to assess the representativeness of the study population, the data consumer needs to examine qualitative metadata attributes (#II and VII), such as population information (C4.1-C4.5 [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]); data type (D3 [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63]); collection methodology (D15 [Hentschel J. Contextuality and data collection methods: a framework and application to health service utilisation. J Dev Stud. 1999;35(4):64-94. [CrossRef]74]); and collection events (D16 [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17,Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]). In addition, quantitative metadata such as the percentage of the population covered in the catchment area (C4.6) should be disclosed by the HMDC assets [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17,Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. Furthermore, the data consumer might want to explore technical details to evaluate a study and its database, respectively (#I, II, and VII). Examples are the vocabularies used to define variables (D14 [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]63]), the CDM according to which the used datasets are structured (D12 [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]), the ETL status (D13 [List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29] 44]) and, if applied, any data linkage strategies (D17 [Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]76]). Moreover, cross-references should be listed in the metadata assets of the HMDC (C17.4) to allow identifying and obtaining lessons learned from other studies, where the same dataset was leveraged (#VII [Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]76]). In doing so, the HMDC facilitates data consumers to identify strengths and limitations of datasets used in conducted studies.
Study Creation and Data Benchmarking
Use case 3 is as follows: a data consumer writes a study protocol that requires to describe the underlying data sets and compare their characteristics.
An HMDC must enable data consumers to easily access standardized metadata information about datasets that need to be specified and compared in a study protocol to be written (#I-VII). For HMDCs, this requires making attribute values of metadata assets directly and easily retrievable for data consumers to facilitate an efficient description and comparison of datasets (particularly, D1-D5; #VII). As such, when writing a study protocol, the data consumer can simply provide links to the metadata assets available in the HMDC, alongside with all kinds of other information that is interesting in the protocol’s context (#III and VII [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]). Providing such links is also beneficial because study readers could, in addition to basic metadata information (D1-D5), be interested in collection methodologies (D15 [Hentschel J. Contextuality and data collection methods: a framework and application to health service utilisation. J Dev Stud. 1999;35(4):64-94. [CrossRef]74]) and events (D16 [Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]17]), data preservation (D18 [Rasmussen KB, Blank G. The data documentation initiative: a preservation standard for research. Arch Sci. 2007;7(1):55-71. [CrossRef]77]), consent requirements (D20 [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43,Angrist M. Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Per Med. 2009;6(6):691-699. [FREE Full text] [CrossRef] [Medline]78]) as well as technical (D9-D14) and compliance specifics (D6-D8). Thus, HMDCs strengthen transparency and reproducibility of studies by facilitating an effective creation of study protocols (#III and V). At the same time, they support data benchmarking by providing detailed standardized metadata, potentially reaching beyond published datasets (#I-VII).
Data Analysis
Use case 4 is as follows: a data consumer wants to benefit from the experience of others for the creation of a study’s programming script or statistical analysis.
If a study relies on a CDM, the HMDC should enable the data consumer to identify, for published datasets, the ETL procedure (D13) from the dataset to the CDM (D12) [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. Irrespective of whether data holders have converted their entire datasets, or only an extraction thereof, this information can support the development of the study script (#VII [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]). If the HMDC publishes cross-references of datasets (C17.4), the data consumer is also facilitated in finding further studies having investigated the same topic or used a comparable research design (#I, II, and VII [Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]76]). These studies may disclose information on how to operationalize data variables, which offers the data consumer additional support in the development of a programming script (#I and VII). After data analysis, the HMDC may require the data consumer to record the developed script in a public repository and provide a link to the study protocol (see use case 3). This allows the HMDC to cross-reference it in the metadata assets of the datasets used (#V [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]). In doing so, transparency, reproducibility, and quality of studies are supported. Importantly, before publishing any study results, the HMDC should enable the data consumer to check whether approvals of data holders, of whom datasets were obtained, are required (D19; #IV and V [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]).
Discussion
Principal Findings
The taxonomy detailed in this work provides 20 dimensions and 101 characteristics to develop FAIR HMDCs, representing initial, yet actionable design knowledge to answer RQ1. Comprehensive description is achieved when amplifying the taxonomy in the light of the cocreated use cases that entail real-world requirements (RQ2). The generated design knowledge provides added value because HMDCs facilitate effective and efficient use of RWD for generating RWE. Thereby, the obtained results accentuate the integration of FAIR principles into HMDCs to ensure “findability, accessibility, interoperability, and reusability” of the RWD circulating within the underlying health care data ecosystem [Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]40]. The design knowledge can be classified into scientific and managerial contributions, as outlined in the following sections.
Interpretation of Findings
The taxonomy’s scientific contributions intensify previous work on data catalogs, paying particular attention to, first, their implementation as decentralized components within data ecosystems and, second, their application in health care peripheries. Consequently, some of the design elements conceptualized in this study draw from prior research about metadata catalogs, as shown in the section Research gap, objective, and questions. Concurrently, they further spin the red paths of HMDC developments with respect to European initiatives pushing health care data ecosystems forward, for example, IDERHA, Elixir, HealthData@EU, EUCAIM, TEHDAS, or EHDEN ( Leading European initiatives toward health care data ecosystems.Multimedia Appendix 1
On the one hand, the conceptualized dimensions and characteristics of the taxonomy describe and classify attributes of “HMDC metadata assets.” First, they relate to attributes associated with data findability, such as data source, managerial details, the type of data records contained, data sensitivity, and population information. Second, they state crucial information pertaining to data reusability. This encompasses collection methodologies, data linkage, and preservation as well as consent and possible result publishing constraints. Third, the taxonomy emphasizes the need to specify data interoperability attributes in metadata assets. Among others, data standardization according to a CDM, the prescription of vocabularies, and the technological implementation of programmatic discoverability are important. On the other hand, the taxonomy contains design elements referring to more general data accessibility constraints pertaining to the “overall HMDC design.” They accentuate basic security and governance considerations regarding dataset and catalog accessibility as well as the access control framework. Conclusively, from a scientific viewpoint, the artifact provides fundamental design knowledge [vom Brocke J, Winter R, Hevner A, Maedche A. Special issue editorial –accumulation and evolution of design knowledge in design science research: a journey through time and space. J Assoc Inf Syst. 2020;21(3):520-544. [CrossRef]31] that unfolds broad implications and a solid starting point for future research.
Regarding managerial contributions, the taxonomy enables health care practitioners (see Introduction section for target audience) to navigate more effectively in the largely unexplored field of HMDCs, particularly focusing on their application in health care data ecosystems across Europe. It helps both researchers and practitioners to anchor and communicate their work [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. The taxonomy also represents a support tool for developing HMDCs, where the illustrative scenarios assume an accentuated role. They showcase how the design elements are reflected in real-world use cases [Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29] 43]. In essence, these use cases amplify that the taxonomy supports common activities for planning, assessing, and conducting medical research studies as well as benchmarking and analyzing the underlying RWD.
Subsequently, the contributions of the study are discussed in the light of 3 major issues. These are (1) the exclusiveness of taxonomy characteristics; (2) the difference between HMDCs and centralized data catalogs; and (3) the absence of data quality and data-sharing incentives as explicit taxonomy dimensions.
Depending on the meta-dimension, the taxonomy contains nonexclusive characteristics, which might be accumulated, to facilitate the design of metadata assets, that is, data findability, interoperability, and reusability. Alternatively, the taxonomy has mutual exclusive characteristics to classify and distinguish HMDC designs with respect to their data security and governance approaches, that is, data accessibility. This mixture of exclusive and nonexclusive dimensions can foster the understanding of health care practitioners, while allowing for an easy alteration of the taxonomy [Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]37]. Convertibility is vital because HMDCs represent a rapidly evolving and changing field, where new solutions vanish and emerge constantly.
Furthermore, even though the objective of the taxonomy is not to differentiate between centralized and decentralized data catalogs, as distinguished in the Theoretical Background section, it pinpoints their fundamental design commonalities and differences. The meta-dimensions concerning the design of the metadata assets are conceivable for both approaches in health care contexts (ie, findability, interoperability, and reusability). The reason is that, despite datasets being stored centrally within intraorganizational data catalogs [Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data catalogs: a systematic literature review and guidelines to implementation. In: Proceedings of the 2021 International Conference on Database and Expert Systems Applications. 2021. Presented at: DEXA '21; September 27-30, 2021; Virtual Event. URL: https://link.springer.com/chapter/10.1007/978-3-030-87101-7_15 [CrossRef]26], meaningful metadata need to be disclosed to their data users by means of the catalog offerings. Naturally, the same holds for decentralized catalogs. However, the taxonomy also shows design differences with respect to its meta-dimension data accessibility. While decentralized catalogs can have various combinations of characteristics in the associated dimensions, centralized health data catalogs exhibit one specific pattern of characteristics. They usually are exclusively private systems (C6.1) as their functionalities are only accessible to members of the operating organization. Similarly, dataset access is strictly limited to this specific group of predefined users (C7.3). Finally, access control lies solely with the organization operating the centralized catalog (C8.1).
As a last discussion point, the taxonomy contains neither dimensions associated with data quality nor incentives for data sharing. The reason is that these concepts, although important, are broad, multifaceted, and hardly explored, making a systematic categorization difficult. Generally, data quality involves subjective and context-dependent assessments [Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst. 2015;12(4):5-33. [CrossRef]92], while incentives to share data are influenced by external, sociopolitical, and institutional factors [Oliveira M, Barros Lima GD, Farias Lóscio B. Investigations into data ecosystems: a systematic mapping study. Knowl Inf Syst. 2019;61(2):589-630. [CrossRef]28]. Typically, HMDCs do not disclose any data quality metrics as those can barely be quantified and are subject to applied data types, formats, and standards [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41]. Rather, HMDCs publish test samples consisting of synthetic or fully anonymized data that do not justify a dimension in the taxonomy. Similarly, incentivizing data sharing, for example, via price tags for datasets or mandatory citations, represents an unsolved problem [Spiekermann S, Acquisti A, Böhme R, Hui KL. The challenges of personal data markets and privacy. Electron Markets. 2015;25(2):161-167. [CrossRef]93]. To circumnavigate this issue, HMDCs commonly rely on membership fees and public funding. The former restricts data access to members of the HMDC operating organization. The latter compensates data providers through public funds, usually applied in preliminary stages. Naturally, other business models exist. However, both data quality indicators and data-sharing incentives represent underdeveloped fields requiring future research beyond the study’s design perspective [Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q. 2004;28(1):75-105. [CrossRef]33]. Consequently, these concepts are not a part of the taxonomy, because they cannot be defined in universally applicable dimensions. Arguably, their inclusion would have overcomplicated the taxonomy and undermined its focus on actionable HMDC design knowledge.
Limitations
The taxonomy is mainly subject to the following limitations. In the inductive iterations, results were derived from a potentially limited number of publications because of the emphasis on 4 main databases. Similarly, in the deductive iterations, the examined analysis objects might merely cover a snapshot of what was available at the time (ie, many analysis objects have been in progress), be outdated quickly, and not be conclusive. As for the SLR, the conclusiveness of analysis objects is particularly questionable because of, first, the focus on European ecosystem initiatives and, second, a possible negligence of many centralized health data catalogs. In the evaluative iterations, the experts might not have captured the full range of relevant perspectives on HMDCs and are limited in number. Furthermore, the research design comprises certain limitations per se. As it is with qualitative research, taxonomy building requires substantial generalizations and simplifications of intricate and interdisciplinary content [Saldana J. The Coding Manual for Qualitative Researchers. Thousand Oaks, CA. SAGE Publications; 2021. 83]. Although countermeasures were taken (see Methods section), these factors imply interpretative biases inevitably incorporated into the results [Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]41], for example, extracting design elements from public data. Moreover, as shown in the Theoretical Background section, new HMDCs must be expected to arise constantly, while others are likely to disappear with a high frequency. Hence, the taxonomy must be altered swiftly. To conclude, the taxonomy provides first actionable design knowledge about HMDCs but requires continuous triangulation of design elements by future research.
Conclusions
Despite the limitations, the scientific and managerial contributions of this study unfold broad implications, which are formulated as recommendations for future research. Generally, HMDCs should be increasingly investigated in practice, for example, by more in-depth case studies. On the one hand, it is of utmost importance to keep track of the rapidly evolving HMDC-related initiatives in Europe. Their conceptual and technical advancements should be analyzed and evaluated constantly against the background of the taxonomy design elements, deriving the need for modifying dimensions and characteristics. On the other hand, by incorporating worldwide efforts toward health care data ecosystems and HMDCs, the scope of the taxonomy can be expanded and design knowledge beyond European jurisdictions can be created. In this regard, it is important to mention that the generated design knowledge about European HMDCs already entails such global implications. The FAIR dimensions of the taxonomy state fundamental characteristics of health data FAIR, making it universally relevant. In other words, the taxonomy conveys generally conceivable options for using catalog functionalities and underlying metadata assets. In addition, it outlines how to design those health metadata assets meaningfully. Therefore, despite the European focus, the taxonomy addresses global challenges with regards to health data sharing and metadata catalog designs, underlining its broad implications. Nevertheless, further research is essential, because HMDCs represent the fulcrum for allocating, exchanging, and using RWD to effectively generate RWE in emerging health care data ecosystems.
Acknowledgments
This research has emerged from the IDERHA (Integration of Heterogeneous Data and Evidence towards Regulatory and Health Technology Assessment Acceptance) project funded by the Innovative Health Initiative of the European Union (GAP-101112135) and the Digital Life Journey project funded by Fraunhofer-Gesellschaft (PN: 051-600000).
Data Availability
Data sharing is not applicable to this paper as no datasets were generated or analyzed during this study.
Authors' Contributions
SS conceptualized the study, conducted the investigation, developed the methodology, managed the project, supervised its execution, drafted the original manuscript, and contributed to reviewing and editing the manuscript. MKM contributed to the conceptualization and investigation, validated the findings, and participated in the review and editing of the manuscript.
Conflicts of Interest
None declared.
Multimedia Appendix 1
Leading European initiatives toward health care data ecosystems.
DOCX File , 30 KBMultimedia Appendix 2
Intermediary results of the iterative research process and key references.
DOCX File , 70 KBReferences
- Agarwal R, Gao GG, DesRoches C, Jha AK. The digital transformation of healthcare: current status and the road ahead. Inf Syst Res. 2010;21(4):796-809. [CrossRef]
- Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652-660. [FREE Full text] [CrossRef] [Medline]
- Remy L, Ivanović D, Theodoridou M, Kritsotaki A, Martin P, Bailo D, et al. Building an integrated enhanced virtual research environment metadata catalogue. Electron Libr. 2019;37(6):929-951. [CrossRef]
- Feldman K, Johnson RA, Chawla NV. The state of data in healthcare: path towards standardization. J Healthc Inform Res. 2018;2(3):248-271. [FREE Full text] [CrossRef] [Medline]
- Swertz M, van Enckevort E, Oliveira JL, Fortier I, Bergeron J, Thurin NH, et al. Towards an interoperable ecosystem of research cohort and real-world data catalogues enabling multi-center studies. Yearb Med Inform. 2022;31(1):262-272. [FREE Full text] [CrossRef] [Medline]
- Bietz MJ, Bloss CS, Calvert S, Godino JG, Gregory J, Claffey MP, et al. Opportunities and challenges in the use of personal health data for health research. J Am Med Inform Assoc. 2016;23(e1):42-48. [FREE Full text] [CrossRef] [Medline]
- Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: the roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. [FREE Full text] [CrossRef] [Medline]
- Wiertz S, Boldt J. Ethical, legal, and practical concerns surrounding the implemention of new forms of consent for health data research: qualitative interview study. J Med Internet Res. 2024;26:e52180. [FREE Full text] [CrossRef] [Medline]
- Molnár-Gábor F, Beauvais MJ, Bernier A, Jimenez MP, Recuero M, Knoppers BM. Bridging the European data sharing divide in genomic science. J Med Internet Res. 2022;24(10):e37236. [FREE Full text] [CrossRef] [Medline]
- Lovestone S, EMIF Consortium. The European medical information framework: a novel ecosystem for sharing healthcare data across Europe. Learn Health Syst. 2020;4(2):e10214. [FREE Full text] [CrossRef] [Medline]
- Manogaran G, Varatharajan R, Lopez D, Kumar PM, Sundarasekar R, Thota C. A new architecture of internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system. Future Gener Comput Syst. 2018;82:375-387. [CrossRef]
- Sharon T, Lucivero F. Introduction to the special theme: the expansion of the health data ecosystem – rethinking data ethics and governance. Big Data Soc. 2019;6(2):205395171985296. [CrossRef]
- Witte AK. A review on digital healthcare ecosystem structure: identifying elements and characteristics. In: Proceedings of the 23rd Pacific Asia Conference on Information System. 2020. Presented at: PACIS '20; July 9-12, 2020:226-238; Dubai, UAE. URL: https://aisel.aisnet.org/pacis2020/228 [CrossRef]
- Scheider S, Lauf F, Möller F, Otto B. A reference system architecture with data sovereignty for human-centric data ecosystems. Bus Inf Syst Eng. 2023;65(5):577-595. [CrossRef]
- Oliveira JL, Trifan A, Bastião Silva LA. EMIF catalogue: a collaborative platform for sharing and reusing biomedical data. Int J Med Inform. 2019;126:35-45. [CrossRef] [Medline]
- EHDS semantic interoperability framework 2022. TEHDAS. URL: https://tehdas.eu/app/uploads/2023/10/tehdas-recommendations-to-enhance-interoperability.pdf [accessed 2024-04-29]
- Bergeron J, Doiron D, Marcon Y, Ferretti V, Fortier I. Fostering population-based cohort data discovery: the Maelstrom Research cataloguing toolkit. PLoS One. 2018;13(7):e0200926. [FREE Full text] [CrossRef] [Medline]
- Ulrich H, Kock-Schoppenhauer A, Deppenwiese N, Gött R, Kern J, Lablans M, et al. Understanding the nature of metadata: systematic review. J Med Internet Res. 2022;24(1):e25440. [FREE Full text] [CrossRef] [Medline]
- Almeida JR, Silva JM, Oliveira JL. A FAIR approach to real-world health data management and analysis. In: Proceedings of the 36th International Symposium on Computer-Based Medical Systems. 2023. Presented at: CBMS '23; June 22-24, 2023:892-897; L'Aquila, Italy. URL: https://ieeexplore.ieee.org/document/10178764 [CrossRef]
- Derycke P, Kesisoglou I, Korsgaard T, Aage Huru A, Catsyne CA. Report on the landscape analysis of available metadata catalogues and the metadata standards in use. HaDEA & European Union. URL: https://ehds2pilot.eu/wp-content/uploads/2024/04/HealthData@EU-Pilot_MS6.1_FIN.pdf [accessed 2024-04-29]
- Peng Y, Bathelt F, Gebler R, Gött R, Heidenreich A, Henke E, et al. Use of metadata-driven approaches for data harmonization in the medical domain: scoping review. JMIR Med Inform. 2024;12:e52967. [FREE Full text] [CrossRef] [Medline]
- Lueschen G, van der Zee J. Health Systems in the European Union: Diversity, Convergence, and Integration: A Sociological and Comparative Analysis in Belgium, France, Germany, ... and Spain. München, Germany. Walter de Gruyter; 2016.
- Lauf F, Scheider S, Friese J, Kilz S, Radic M, Burmann A. Exploring design characteristics of data trustees in healthcare - taxonomy and archetypes. In: Proceedings of the 31st European Conference on Information Systems. 2023. Presented at: ECIS '23; June 11-16, 2023:33-57; Kristiansand, Norway. URL: https://www.researchgate.net/publication/370060215_Exploring_Design_Characteristics_of_Data_Trustees_in_Healthcare_-_Taxonomy_and_Archetypes
- Labadie C, Legner C, Eurich M, Fadler M. FAIR enough? Enhancing the usage of enterprise data with data catalogs. In: Proceedings of the 2020 IEEE 22nd Conference on Business Informatics. 2020. Presented at: CBI '20; June 22-24, 2020; Antwerp, Belgium. URL: https://ieeexplore.ieee.org/document/9140254 [CrossRef]
- Jahnke N, Otto B. Data catalogs in the enterprise: applications and integration. Datenbank Spektrum. Jun 21, 2023;23(2):89-96. [CrossRef]
- Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data catalogs: a systematic literature review and guidelines to implementation. In: Proceedings of the 2021 International Conference on Database and Expert Systems Applications. 2021. Presented at: DEXA '21; September 27-30, 2021; Virtual Event. URL: https://link.springer.com/chapter/10.1007/978-3-030-87101-7_15 [CrossRef]
- Gröger C. There is no AI without data. Commun ACM. Oct 25, 2021;64(11):98-108. [CrossRef]
- Oliveira M, Barros Lima GD, Farias Lóscio B. Investigations into data ecosystems: a systematic mapping study. Knowl Inf Syst. 2019;61(2):589-630. [CrossRef]
- Sohail SA, Bukhsh FA, van Keulen M. Multilevel privacy assurance evaluation of healthcare metadata. Appl Sci. 2021;11(22):10686. [CrossRef]
- Almeida JR, Oliveira JL. MONTRA2: a web platform for profiling distributed databases in the health domain. Inform Med Unlocked. 2024;45:101447. [CrossRef]
- vom Brocke J, Winter R, Hevner A, Maedche A. Special issue editorial –accumulation and evolution of design knowledge in design science research: a journey through time and space. J Assoc Inf Syst. 2020;21(3):520-544. [CrossRef]
- van Aken JE. Valid knowledge for the professional design of large and complex design processes. Des Stud. 2005;26(4):379-404. [CrossRef]
- Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Q. 2004;28(1):75-105. [CrossRef]
- March ST, Storey VC. Design science in the information systems discipline: an introduction to the special issue on design science research. MIS Q. 2008;32(4):725-730. [CrossRef]
- Cash PJ. Developing theory-driven design research. Des Stud. 2018;56:84-119. [CrossRef]
- Kundisch D, Muntermann J, Oberländer AM, Rau D, Röglinger M, Schoormann T, et al. An update for taxonomy designers. Bus Inf Syst Eng. 2021;64(4):421-439. [CrossRef]
- Nickerson RC, Varshney U, Muntermann J. A method for taxonomy development and its application in information systems. Eur J Inf Syst. 2017;22(3):336-359. [CrossRef]
- Gregor S. The nature of theory in information systems. MIS Q. 2006;30(3):611-642. [CrossRef]
- Szopinski D, Schoormann T, Kundisch D. Because your taxonomy is worth IT: towards a framework for taxonomy evaluation. In: Proceedings of the 27th European Conference on Information Systems. 2019. Presented at: ECIS '19; June 8-14, 2019:25-44; Stockholm-Uppsala, Sweden. URL: https://www.researchgate.net/publication/332711034_Because_your_taxonomy_is_worth_it_Towards_a_framework_for_taxonomy_evaluation
- Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1-9. [FREE Full text] [CrossRef] [Medline]
- Scheider S, Lauf F, Geller S, Möller F, Otto B. Exploring design elements of personal data markets. Electron Markets. 2023;33(1):1-16. [CrossRef]
- Alvarellos M, Sheppard HE, Knarston I, Davison C, Raine N, Seeger T, et al. Democratizing clinical-genomic data: how federated platforms can promote benefits sharing in genomics. Front Genet. Jan 10, 2022;13:1045450. [FREE Full text] [CrossRef] [Medline]
- Good practice guide for the use of the metadata catalogue of real-world data sources. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/good-practice-guide-use-metadata-catalogue-real-world-data-sources_en.pdf [accessed 2024-04-29]
- List of metadata for real world data catalogues. European Medicines Agency. 2022. URL: https://www.ema.europa.eu/en/documents/other/list-metadata-real-world-data-catalogues_en.pdf [accessed 2024-04-29]
- Data directory. BBMRI-ERIC. URL: https://directory.bbmri-eric.eu/ERIC/directory/#/catalogue [accessed 2024-11-02]
- An interactive catalogue of mental health and wellbeing measures in British cohort and longitudinal studies. Catalogue of Mental Health Measures. URL: https://www.cataloguementalhealth.ac.uk/ [accessed 2024-10-27]
- Data catalogue for healthcare. Compendium. URL: https://compendiumdatacatalog.com/data-catalog/ [accessed 2024-10-28]
- Findable, standardised data at scale through the EHDEN database catalogue. EHDEN Portal. URL: https://www.ehden.eu/ehden-portal/ [accessed 2024-10-22]
- EMIF catalogue. EMIF. URL: https://www.emif.eu/emif-catalogue/ [accessed 2024-10-19]
- Cancer image Europe. EUCAIM Catalogue. URL: https://catalogue.eucaim.cancerimage.eu/#/ [accessed 2024-10-25]
- Welcome to the one-stop shop that facilitates access to population health and health care data, information and expertise across Europe. European Health Information Portal. URL: https://www.healthinformationportal.eu/ [accessed 2024-10-18]
- Data catalogues for health. Fjelltopp. URL: https://www.fjelltopp.org/service/data-catalogues-for-health/ [accessed 2024-10-28]
- Data catalogues. Health RI. URL: https://catalogus.healthdata.nl/datasets [accessed 2024-10-29]
- Explore data sources. Helsedata. URL: https://helsedata.no/en/ [accessed 2024-10-29]
- Health data catalog. IQVIA. URL: https://www.iqvia.com/library/fact-sheets/iqvia-health-data-catalog [accessed 2024-09-25]
- Health pilot. Kraken. URL: https://www.krakenh2020.eu/pilots/health [accessed 2024-11-01]
- Precision medicine data catalog. Lifebit. URL: https://www.lifebit.ai/federated-data-catalogue [accessed 2024-11-02]
- Maelstrom catalogue. Maelstrom Research. URL: https://www.maelstrom-research.org/page/catalogue [accessed 2024-11-26]
- Trials data catalogue. Yoda. URL: https://yoda.yale.edu/trials-search/ [accessed 2024-10-19]
- Shi J, Zheng M, Yao L, Ge Y. DIR — a semantic information resource for healthcare datasets. In: Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine. 2017. Presented at: BIBM '17; November 13-16, 2017:805-810; Kansas City, MO. URL: https://ieeexplore.ieee.org/abstract/document/8217758 [CrossRef]
- McCoy D, Chand S, Sridhar D. Global health funding: how much, where it comes from and where it goes. Health Policy Plan. 2009;24(6):407-417. [CrossRef] [Medline]
- Yurkovich M, Avina-Zubieta JA, Thomas J, Gorenchtein M, Lacaille D. A systematic review identifies valid comorbidity indices derived from administrative health data. J Clin Epidemiol. 2015;68(1):3-14. [CrossRef] [Medline]
- Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform. 2023;137:104272. [FREE Full text] [CrossRef] [Medline]
- Brunswick D. Data privacy, data protection, and the importance of integration for GDPR compliance. Isaca J. 2019;1:14-27.
- Yang K, Jia X, Ren K, Zhang B, Xie R. DAC-MACS: effective data access control for multiauthority cloud storage systems. In: Proceedings of the 2013 IEEE Conference on Computer Communications Workshops. 2013. Presented at: INFOCOM '13; April 14-19, 2013:2895-2903; Turin, Italy. URL: https://ieeexplore.ieee.org/document/6567100 [CrossRef]
- Shabani M, Borry P. "You want the right amount of oversight": interviews with data access committee members and experts on genomic data access. Genet Med. 2016;18(9):892-897. [FREE Full text] [CrossRef] [Medline]
- Munoz-Arcentales A, López-Pernas S, Pozo A, Alonso Á, Salvachúa J, Huecas G. An architecture for providing data usage and access control in data sharing ecosystems. Procedia Comput Sci. 2019;160:590-597. [CrossRef]
- Ngouongo SM, Löbe M, Stausberg J. The ISO/IEC 11179 norm for metadata registries: does it cover healthcare standards in empirical research? J Biomed Inform. 2013;46(2):318-327. [FREE Full text] [CrossRef] [Medline]
- de Mello BH, Rigo SJ, da Costa CA, da Rosa Righi R, Donida B, Bez MR, et al. Semantic interoperability in health records standards: a systematic literature review. Health Technol (Berl). 2022;12(2):255-272. [FREE Full text] [CrossRef] [Medline]
- Iroju O, Soriyan A, Gambo I, Olaleke J. Interoperability in healthcare: benefits, challenges and resolutions. Int J Innov Appl Stud. 2013;3(1):262-270.
- Roehrs A, da Costa CA, da Rosa Righi R, Rigo SJ, Wichman MH. Toward a model for personal health record interoperability. IEEE J Biomed Health Inform. 2019;23(2):867-873. [CrossRef]
- Kent S, Burn E, Dawoud D, Jonsson P, Østby JT, Hughes N, et al. Common problems, common data model solutions: evidence generation for health technology assessment. Pharmacoeconomics. 2021;39(3):275-285. [FREE Full text] [CrossRef] [Medline]
- Ivanović M, Budimac Z. An overview of ontologies and data resources in medical domains. Expert Syst Appl. 2014;41(11):5158-5166. [CrossRef]
- Hentschel J. Contextuality and data collection methods: a framework and application to health service utilisation. J Dev Stud. 1999;35(4):64-94. [CrossRef]
- Tijhuis M, Finger JD, Slobbe L, Sund R, Tolonen H. Data collection. In: Marieke V, van Oers H, editors. Population Health Monitoring: Climbing the Information Pyramid. Cham, Switzerland. Springer; 2019:59-81.
- Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, et al. Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010;10(1):346-354. [FREE Full text] [CrossRef] [Medline]
- Rasmussen KB, Blank G. The data documentation initiative: a preservation standard for research. Arch Sci. 2007;7(1):55-71. [CrossRef]
- Angrist M. Eyes wide open: the personal genome project, citizen science and veracity in informed consent. Per Med. 2009;6(6):691-699. [FREE Full text] [CrossRef] [Medline]
- Kitchenham B, Pearl Brereton O, Budgen D, Turner M, Bailey J, Linkman S. Systematic literature reviews in software engineering – a systematic literature review. Inf Softw Technol. 2009;51(1):7-15. [CrossRef]
- Webster J, Watson RT. Analyzing the past to prepare for the future: writing a literature review. MIS Q. 2002;26(2):13-23. [FREE Full text]
- Mwita K. Factors influencing data saturation in qualitative studies. Int J Bus Soc. Jun 05, 2022;11(4):414-420. [CrossRef]
- Proposal for a regulation - the European Health Data Space. Directorate-General for Health and Food Safety. 2022. URL: https://health.ec.europa.eu/publications/proposal-regulation-european-health-data-space_en#details [accessed 2024-11-07]
- Saldana J. The Coding Manual for Qualitative Researchers. Thousand Oaks, CA. SAGE Publications; 2021.
- Pratt MG. Fitting oval pegs into round holes. Organ Res Methods. 2007;11(3):481-509. [CrossRef]
- Fruhwirth M, Rachinger M, Prlja E. Discovering business models of data marketplaces. In: Proceedings of the 53rd Hawaii International Conference on System Sciences. 2020. Presented at: HICSS '20; January 7-10, 2020:5738-5747; Maui, HI. URL: https://scholarspace.manoa.hawaii.edu/server/api/core/bitstreams/cf7bab54-478b-412a-8742-6b02f10dd7ca/content [CrossRef]
- Database (part of Elixir infrastructure). Elixir BioSamples. URL: https://www.ebi.ac.uk/biosamples/ [accessed 2024-12-03]
- A registry of knowledgebases and repositories of data. Elixir FAIRsharing. URL: https://fairsharing.org/search?fairsharingRegistry=Database [accessed 2024-12-04]
- Global health data exchange. Institute for Health Metrics and Evaluation. URL: https://ghdx.healthdata.org/ [accessed 2024-10-29]
- MACH clinical and research dataset catalogue. Melbourne Academic Centre for Health. URL: https://figshare.unimelb.edu.au/MACH-catalogue?searchMode=1 [accessed 2024-11-24]
- Möller F, Stachon M, Azkan C, Schoormann T, Otto B. Designing business model taxonomies – synthesis and guidance from information systems research. Electron Mark. 2021;32(2):701-726. [CrossRef]
- Pine KH. The qualculative dimension of healthcare data interoperability. Health Informatics J. 2019;25(3):536-548. [FREE Full text] [CrossRef] [Medline]
- Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst. 2015;12(4):5-33. [CrossRef]
- Spiekermann S, Acquisti A, Böhme R, Hui KL. The challenges of personal data markets and privacy. Electron Markets. 2015;25(2):161-167. [CrossRef]
Abbreviations
CDM: common data model |
DAC: Data Access Committee |
E2C: empirical-to-conceptual |
EHDEN: European Health Data and Evidence Network |
EHDS: European Health Data Space |
EHDS2: European Health Data Space 2 |
ETL: extract, transform, load |
EU: European Union |
EUCAIM: European Federation for Cancer Images |
FAIR: findability, accessibility, interoperability, and reusability |
HMDC: health metadata catalog |
IDERHA: Integration of Heterogeneous Data and Evidence towards Regulatory and Health Technology Assessment Acceptance |
IS: information systems |
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
RQ: research question |
RWD: real-world data |
RWE: real-world evidence |
SLR: structured literature review |
TEHDAS: Towards European Health Data Space |
Edited by A Mavragani; submitted 19.06.24; peer-reviewed by Y Arayici, M Gersch; comments to author 06.08.24; revised version received 20.08.24; accepted 12.12.24; published 18.02.25.
Copyright©Simon Scheider, Mostafa Kamal Mallick. Originally published in JMIR Formative Research (https://formative.jmir.org), 18.02.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.