Original Paper
Abstract
Background: Secondary investigations into digital health records, including electronic patient data from German medical data integration centers (DICs), pave the way for enhanced future patient care. However, only limited information is captured regarding the integrity, traceability, and quality of the (sensitive) data elements. This lack of detail diminishes trust in the validity of the collected data. From a technical standpoint, adhering to the widely accepted FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for data stewardship necessitates enriching data with provenance-related metadata. Provenance offers insights into the readiness for the reuse of a data element and serves as a supplier of data governance.
Objective: The primary goal of this study is to augment the reusability of clinical routine data within a medical DIC for secondary utilization in clinical research. Our aim is to establish provenance traces that underpin the status of data integrity, reliability, and consequently, trust in electronic health records, thereby enhancing the accountability of the medical DIC. We present the implementation of a proof-of-concept provenance library integrating international standards as an initial step.
Methods: We adhered to a customized road map for a provenance framework, and examined the data integration steps across the ETL (extract, transform, and load) phases. Following a maturity model, we derived requirements for a provenance library. Using this research approach, we formulated a provenance model with associated metadata and implemented a proof-of-concept provenance class. Furthermore, we seamlessly incorporated the internationally recognized Word Wide Web Consortium (W3C) provenance standard, aligned the resultant provenance records with the interoperable health care standard Fast Healthcare Interoperability Resources, and presented them in various representation formats. Ultimately, we conducted a thorough assessment of provenance trace measurements.
Results: This study marks the inaugural implementation of integrated provenance traces at the data element level within a German medical DIC. We devised and executed a practical method that synergizes the robustness of quality- and health standard–guided (meta)data management practices. Our measurements indicate commendable pipeline execution times, attaining notable levels of accuracy and reliability in processing clinical routine data, thereby ensuring accountability in the medical DIC. These findings should inspire the development of additional tools aimed at providing evidence-based and reliable electronic health record services for secondary use.
Conclusions: The research method outlined for the proof-of-concept provenance class has been crafted to promote effective and reliable core data management practices. It aims to enhance biomedical data by imbuing it with meaningful provenance, thereby bolstering the benefits for both research and society. Additionally, it facilitates the streamlined reuse of biomedical data. As a result, the system mitigates risks, as data analysis without knowledge of the origin and quality of all data elements is rendered futile. While the approach was initially developed for the medical DIC use case, these principles can be universally applied throughout the scientific domain.
doi:10.2196/50027
Keywords
Introduction
Provenance—a piece of metadata—is considered information that is fundamental in the data life cycle because it expresses the traceability of the processed data and facilitates the reproducibility of the results [
, ]. The availability of provenance throughout the data life cycle is deemed a crucial factor for maintaining trust in the data at all stages [ ]. The data life cycle encompasses data generation, processing, validation, analysis, reporting, and application for decision-making in any context, culminating in storage within a specified retention period [ ]. Medical data integration centers (DICs), particularly those established within the German Medical Informatics Initiative, must enhance accountability for their activities. This is particularly crucial for the methods used in extracting, transforming, and loading sensitive patient data from heterogeneous clinical routine systems into (standardized) research data repositories for subsequent secondary use [ ]. In this given context, it is necessary to understand the limitations of the provided data [ ]. Collecting comprehensive and pertinent contextual provenance information along these processing pipelines is one approach to enhance the accountability of the medical DIC ( ). Provenance and integrity must be systematically evaluated and documented in routinely collected data sets to facilitate their reuse in clinical trials [ ].Accountability means accepting responsibility for activities and in this context entails all procedures and processes for data managing pipelines [
]. This includes keeping the movement of data elements transparent and traceable. Provenance traces enable documentation of this movement and hence generate trust in the data integrity and reliability of the provided data for secondary use.To achieve reproducibility [
] and integrity when exchanging data between academia and industry, researchers must adhere to essential research principles, particularly following good practice guidelines (eg, good clinical practice, good research/scientific practice, commonly referred to as GxP) [ ]. Ensuring and evaluating data integrity and data provenance are anticipated to be prerequisites for clinical trial data [ ]. For instance, the clinical research data quality standard ALCOA+ (Attributable, Legible, Contemporaneous, Original, and Accurate+) articulates enhanced data integrity properties and fundamentally contributes to provenance information [ ]. These properties pertain to attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available data characteristics [ ].In addition to adhering to good scientific practice [
], heightened legal requirements such as compliance with the General Data Protection Regulation (GDPR) in the European Union, or contractual obligations, mandate evidence-based data processing for both deidentification and reidentification of data, encompassing the life cycle of the patient’s consent [ ].A crucial factor in advancing these objectives is the metadata acquired from the data transformation and integration process throughout the data life cycle. The field of biological research has already acknowledged the significance of metadata, as outlined in ISO norms such as ISO/CD 20961 [
] and ISO/TC 276/WG5 on data processing and integration [ ]. ISO 20961, for example, specifies requirements for the consistent formatting and documentation of data and metadata.Furthermore, the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles for data management and data stewardship emphasize the overall relevance of metadata for the data itself, including those used in infrastructures and services [
]. Aspects of the FAIR recommendations explicitly address provenance capture. As such, the “R1.2” FAIR principle demands machine-accessible and readable metadata, which include provenance information about the data creation or generation. Related metadata accumulate not only during the data transformation itself but also within the software used [ ]. The principle “R1.3” expects metadata to be adhering to domain-relevant community standards such as the HL7 Fast Healthcare Interoperability Resources (FHIR) or Dublin core [ ]. FHIR is an internationally recognized standard that supports the exchange of data between different software systems within the health care sector [ ]. In this vein, the FHIR resource “provenance” records entities and processes involved in creating a specific resource. From a technical point of view, the FHIR Provenance resource is founded on the framework of the open W3C standard PROV-Data Model definition and ontology [ ], the successor to the Open Provenance Model [ ]. Here, the concepts of linked entities, activities, and agent resources enable the establishment of a provenance model. Such resources can be described with the W3C Resource Description Framework (RDF) method [ ]. RDF is a data model, which is commonly stored in formats such as RDF/XML (.rd) or JSON-LD (.json). All formats represent a knowledge graph.As of now, the capture of provenance in health care is not adequately or uniformly implemented in German medical DICs, as revealed in a recent study on their data management status [
]. The results demonstrated that provenance is indeed a factor strongly influenced by the maturity level of data management practices. Following complex transformations in the data integration process, the provenance of data elements is often lost, making it difficult to impossible to assess the (measurement) quality of a data element. This reduction in traceability diminishes trust in the validity of the collected data.The primary objective of this study is to improve the reusability of clinical routine data within a medical DIC for its secondary application in clinical research. Our goal is to enhance processed clinical routine data by incorporating appropriate semantic metadata, a key requirement guided by the FAIR principles [
]. Furthermore, our intention is to bolster the accountability of our DIC by mitigating the risks associated with the reuse of compromised data in clinical research.To our knowledge, this is the first demonstration of provenance integration within a medical DIC.
Methods
Materials
We used test data to develop and test our provenance class. Test data elements were chosen to reflect the composition of a typical data integration repository. We created exemplary dummy data element definitions with comprehensive annotation (
). We defined 7 data element types and generated 100,000 data elements for each data element type to generate a total of 700,000 provenance records using a Python (Python Foundation) script.id=’syst_blood_pressure’,
name=’syst_blood_pressure’,
description=’Systolic Blood Pressure’,
source=’stg_sap_vitalis’,
source_variable=’SysBP’,
destination=’dwh_vitalis’,
destination_variable=’SBP’,
description_of_transformation=’copy’,
description_of_qualitycheck=’range check 80-160’,
status_log=’passed date 12.May2022’,
sop_name=’SOP p’
sop_version=’v1.5’,
sop_status=’approved’,
steward_name=’no name given’)
Proof-of-Concept Solution
Following the tailor-made provenance framework [
], we developed a proof-of-concept provenance solution. This framework complements a standard software engineering cycle (requirements, design, coding, testing, and implementation) with insights from a comprehensive literature search and uses established works as a guide to the users of the framework. The expanded requirements analysis is substantiated by the topics identified through the literature search. Details are described in .Requirements Analysis
Overview
An interdisciplinary team of internal stakeholders in the University Medicine Mannheim-DIC (lead, medical experts, computer scientists, technical staff, and process owner of the ETL [extract, transform, and load] process) performed the requirements analysis for the research approach. Initially, we engaged in discussions, documented feedback, and obtained approval for our own data pipeline processes, based on the WH questions (what, when, where, who, why, how, which, whose). This was done to ensure accurate and risk-managed data processing pipelines. Our focus centered on questions related to data governance, annotation, documentation, interoperability, data integrity and accuracy, data sharing, and information technology operations. This emphasis aligns with a prior investigation on data management practices in German DICs [
], where these questions were identified as integral to tracing patient data through the DICs.Building on the previous steps, we initiated the process by visualizing the scope definition (system border and context) of the planned provenance tracking systems. Using notation according to DeMarco [
], we generated a data flow diagram. Following this, we documented the resultant requirements, representing them in free text and as a unified modeling language (UML) class diagram to address various requirements perspectives [ ].System Border and Context
The context view (
) is used to delineate the scope of our system, establishing the boundary between functionalities that are considered in and out of scope. The system to be modeled, known as the Provenance Information System Traces (PISA), is depicted as a circle in the center (outlined by the dotted red line in ). At the conceptual level, we established the system border to encompass all aspects within the object scope. We delineated the system context (depicted in green as a freehand drawing) with aspects (A to H) that impact the planned provenance tracking system in our medical DIC. The processes that were modeled had been previously defined by local stakeholders and were influenced by the processes of the medical informatics initiative community [ ]. The core process, the ETL process (D), includes valid documents (G) (eg, statutes, standard operating procedures, European Union-GDPR) and the involvement of stakeholders within and beyond the organizational unit (H), representing the primary focus of our development efforts. Existing software and hardware systems (A–C), as well as the processes of secondary usage for data request (E) and long-term archiving (F), are outside the scope of this study.Data Flow
Given the multitude of processes within a DIC, we confined our focus to the requirements related to the data integration process (
; ETL, letter D). We scrutinized the data flow and derived a data flow diagram, illustrating the functional requirements perspective ( ). As part of the Medical Informatics Initiative, all DICs in Germany modeled a comparable, generic data flow. This data flow delineates the movement of data among processes (ETL), storage entities (staging area, data warehouse, FHIR server, and research data repository), and involved actors (staff in DIC, researcher, and trusted third party). Processes encapsulate functions responsible for transforming processing data. These processes consume input data from diverse systems, manage these data, and convey the results to an output. Storage ensures data persistence, allowing processes to access the storage in read or write modes. Actors actively engage in information exchange with the system.Requirements Description
In a previous publication, we conducted interviews with various German medical DICs [
]. Through these interviews, we identified the most crucial requirements, emphasizing assessments of data quality, traceability, and information capability. Additionally, transparency in processing steps, workflows, and data sets emerged as a significant consideration. Other identified requirements encompassed aspects such as debugging or performance evaluation. Additionally, there was a focus on compliance with regulations, reproducibility, support of the scientific utilization process, increased confidence in data, and clear regulation of responsible parties [ ].In alignment with this study, we established preconditions and requirements along the data flow for implementing the provenance tracking system. We identified the intended features for the implementation of the PISA and derived the system’s requirements (
). In general, PISA should have the capability to trace the complete production history of a data element while incorporating domain-specific characteristics of the data element. These provenance traces for an individual data element must be captured along the presented data flow.Number | Requirements (functional and nonfunctional) | Explanation |
1 | PISA must have the capability to track the complete processing history of a data element, and the provenance information must be stored in a database. This encompasses all derivation steps performed on data elements during their processing steps. | It includes all the information (metadata) required for producing a specific data set or a data element while preserving its data integrity status. This encompasses details such as data source, data destination, method, tools, software, and versions used. The benchmark should align with the “entities” and “activities” components of the W3C model. |
2 | PISA must possess the capability to trace organizational responsibilities and the means used. | It includes information (metadata) about all the involved agents in producing a data set or data elements, such as staff, standard operating procedures, and guidance. The benchmark should align with the “agent” components of the W3C model. |
3 | PISA must be analyzable by an authorized user and capable of producing diverse representations and export formats for the provenance traces. | Detailed provenance traces are accessible and exportable to support evaluation by users, including formats such as log files, FHIRb provenance, W3Cc RDFd/XML, and RDF/JSON-LD provenance. |
4 | PISA must be able to track the quality status and assessment of data elements. | The provenance information for a data element is expanded to include the quality status of the processed data element. |
5 | PISA must be able to track the status of the script execution. | At a minimum, the provenance information should encompass the verification status and time stamp of the processed scripts. |
7 | PISA must provide a high level of ease of use for ETLe programmers and should be usable without requiring in-depth knowledge of provenance terms and concepts. | PISA should facilitate easy integration into ETL pipelines with transfer interfaces, allowing seamless integration with established technologies. Moreover, it must be easy to install, for example, by supporting widely used and easily set up databases. |
8 | PISA must be time-efficient and capable of ensuring acceptable performance. | Time measurements per data element must take place and be evaluated to verify the feasibility of the proof-of-concept approach. |
9 | Verification by unit tests/code coverage >80% | Passed testing results. |
aPISA: Provenance Information System Traces.
bFHIR: Fast Healthcare Interoperability Resources.
cW3C: Word Wide Web Consortium.
dRDF: Resource Description Framework.
eETL: extract, transform, and load.
Design and Architecture of the Provenance Class
Development of the Logical Data Model
Based on the aforementioned requirements (
) and the DIC maturity model [ ], we constructed the logical data model as a UML class diagram, identifying classes and their associations ( ).Metadata Strategy
Our metadata strategy centered on characterizing the data elements and their associated artifacts throughout their processing pipeline.
Aligned with the requirements and the logical model, we extracted the pertinent provenance metadata and aligned this provenance profile with the W3C components entity, agent, and activity. Simultaneously, we diligently enforced documentation efforts and annotation, guided by good documentation practices such as the ALCOA(+) principles for the identified components [
]. The annotation process we implemented enhanced the comprehension, increased understanding, and improved the traceability of the processed data elements.The FAIR principles R1.2 and R1.3 guided us to enrich (R) data elements with meaningful (provenance) metadata. Consequently, we characterized data elements by collecting content-rich contextual and technical metadata that narrate the story of the entire data processing workflow and link to related artifacts (
). During the transformation processes, we documented quality procedures and incorporated coding practices and versioning information.Levelc | Descriptiond | Possible mappinge | Exemplified outputf |
Data Governanceg | Name and version of the standard operating procedures or regulation (eg, “DIC_ETL-ST.pdf, v1, approved”) | .policy .agent.type | “policy” : [“http://example.org/policy/1234”], “location”: { “reference”: “DIC” }, |
Data Owner | Name of the (hospital) department and the responsible person owning the patient data (eg, physician or stakeholder name) | .authorization .agent .agent.type .agent.role .agent.who .agent.onBehalfOf | “authorization”: { “coding”: [ { “system”: “http://terminology.hl7.org/CodeSystem/v3-ActReason”, “code”: “TRANSRCH” } ] }, |
Data Steward | Name of the responsible data steward (eg, person who takes care of data management) | .location .agent .agent.type .agent.role .agent.who .agent.onBehalfOf | “agent”: { “who”: { “display”: “Hr. Koch” } } |
Data Store | Used input or created output data file as part of the processing pipeline (eg, name original source system and name target system) | .entity .entity.role .entity.what .target (as mapping from entity) | “entity”: { “what”: { “identifier”: [ { “system”: “urn:ietf:rfc:3986”, “value”: “243c773b-8936-407e-9c23-270d0ea49cc4”, “display”: “” } ] } } |
Data Script | Scripts or programs developed to process the data with a description of script version and name and creator (eg, etl_st.py v1 MZ) | .activity .basedOn .agent.type | “activity”: { “coding”: [ { “system”: “http://terminology.hl7.org/CodeSystem/iso-21089-lifecycle”, “code”: “averaging”, “display”: “Transform” } ] } “basedOn”: [ { “reference” : “ServiceRequest” } ] |
Data Element | Individual characteristics per data element during a processing step such as ID, name, description, source and destination information from Data Store Level, description of the transformation approach, description of quality check (testing and validation approach), privacy and security status, and information from Script Level | .entitiy .entity.role .entity.what .entity.agent | Schema as in Data Store Level |
Data Provenance | References to all other mentioned levels and testimony for quality (eg, “25, 3, 5, good, 2023-02-03 06:01:34”) | .id .occuredDateTime .recorded .patient .encounter .target | “id” : “id” ”occuredDateTime“: ”timestamp“, ”recorded“: ”timestamp“ |
Data Infrastructureg | Used hardware and software conditions during data processing | N/Ah | N/A |
aFHIR: Fast Healthcare Interoperability Resources.
bW3C: Word Wide Web Consortium.
cLevel corresponds to the maturity level of the data integration center.
dDescription of the possible content or annotation.
ePossible mapping to the Health Level 7 FHIR resource “Provenance.”
fOne possible exemplified output extract as a serialization in FHIR JSON.
gNot yet or only partly implemented.
hN/A: not applicable.
Examples of expanded metadata elements are more detailed descriptions of the transformation, the quality check, and the status of the data element in scope, or the results of the used log files. The metadata gathering for provenance comprises both manual annotation and an automated collection process, representing a hybrid form of provenance [
].Ontology
We organized, annotated, and represented information using WebProtégé 4.0.2 (Protege Team in the Biomedical Informatics Research Group at Stanford University), a tool designed for collaboratively creating complex ontologies [
]. The W3C PROV ontology and the fundamental relationships between entities, activities, and agents served as a framework for representing the provenance graph [ ]. More specifically, we mapped processes onto activities, actors onto agents, and input/output data onto entities. The attributes of the provenance data model were aligned with the attributes of the data set. An instantiation of the provenance model, reflecting the W3C PROV vocabulary and layout convention, is illustrated in . Additionally, the W3C PROV supports interoperable interchange of provenance in heterogeneous environments.Implementation and Verification Approach
Finally, building on the preceding steps, we developed an open-source Python class “Data Provenance” with associated methods, and validated our approach in an exemplified data integration pipeline [
]. Provenance traces were mapped exemplarily onto the W3C RDF/XML and HL7 FHIR resource “Provenance” in its current maturity level (version R 5). We utilized peewee (version 3.15.4), a Python Object-Relational Mapping library that supports the binding of objects to relational databases such as SQLite, MySQL, or PostgreSQL [ ]. To visualize the provenance traces, we used the Mermaid plotting framework [ ].The verification and validation approach for the developed provenance class involved an independent code review and unit tests to ensure that the code meets the requirements of the design. We assessed efficiency (storage space in kilobytes and computing time) and ensured the maintainability of the program (code structure, modularity, comments in code, currency, and comprehensibility of documentation).
While creating provenance records, we conducted a runtime experiment to measure the performance of our developed class. We recorded the time that the program took to run for proper execution. The runtime environment comprised the operating system Ubuntu 22.04.2 LTS (Canonical Ltd.), 32 GB memory, and an 8-core Intel Xeon Platinum 8276 CPU @ 2.20-GHz computer.
As a runtime environment, we used a virtual machine running on top of the machine. The runtime period was defined as the duration when the program was actively running.
We conducted measurements per data element and per provenance record on 9 virtual machines, each utilizing different data element block sizes (starting with 1, 10, 100, 1000, 10,000, and 100,000 up to 9, 90, 900, 9000, 90,000, and 900,000 data elements). For the analysis of runtime measurements, we used R version 4.2.0 (2022; R Foundation for Statistical Computing), and figures were generated using the ggplot2 package [
].The code is available in a git repository under the Massachusetts Institute of Technology (MIT) license [
].Ethical Considerations
Given the nature of the proof-of-concept study relying on dummy test data, ethics approval, informed consent, and deidentification were not applicable.
Results
Provenance Traces Representation
All the gathered provenance information is in a machine-readable format. Additionally, FHIR health care standards were used [
].We developed an FHIR profile based on the “provenance” resource, resulting in a record that delineates the entities and processes involved in producing, delivering, or otherwise influencing that resource. This was accomplished by mapping the contextual and technical metadata to the corresponding resource provenance elements (
).Through the integration of all metadata levels, we facilitated the traceability of each data element. We illustrated the traceability using a data flow diagram and presented it in a human-readable text form. Additionally, the provenance information was exported into various formats such as FHIR-JSON, W3C-RDF/XML, W3C-RDF/JSON-LD, or a text-based log file. This approach aligns with data obtained in other studies [
].Measurement of Provenance Traces
As anticipated, the specified provenance class successfully generated the database and the metadata tables according to the UML class diagram (illustrated in
). Provenance records were automatically appended to the provenance table throughout the execution of the exemplified data integration pipeline. We recorded runtime measurements of the algorithm, displayed separately for the storage duration of a data element and for a record, as well as the corresponding increase in the database ( ). As evident, the runtime complexity of the algorithm per data element indicates a nearly linear relationship with the size of the input data.We observed an acceptable runtime duration ranging from 0.0039 to 0.02601 seconds per data element. However, when measuring the runtime for a provenance record, we encountered an increasing duration, ranging from 0.0271 to 0.1882 seconds. Given that our approach incorporates novel aspects, we were unable to find comparable studies for this measurement. Nevertheless, the data obtained here suggest that using this approach to establish provenance traces can yield accurate and timely information.
Verification and Validation
The validation status for our proof-of-concept provenance class is outlined in
. We anticipate that our results can be readily adopted for additional metadata components and seamlessly transferred to decision-making applications.Requirement number | Validation result |
1 | Introduction of metadata for data elements and their processing collected automatically during ETLa job running in data flow. Relevant tables (DataProvenance, DataElement, and associated tables) in the provenance database were created and continuously updated during processing. |
2 | Organizational topics (DataGovernance, DataSteward, and DataOwner) were recorded in the provenance database and continuously updated during processing. |
3 | Provenance traces were created in different formats. Detailed provenance traces are accessible and exportable to support evaluation by users (eg, FHIRb provenance, W3Cc RDFd/XML RDF/JSON-LD provenance). |
4 | The quality status of a processed data element is tracked and currently presented with a placeholder value in the DataProvenance table (see the “Future Work” section). |
5 | The verification status of used scripts and time stamps were recorded in the table DataElement. More specific content-related provenance information needs to be added in the second step. This compromises detailed annotation about the performed transactions and can be used for handling inconsistencies and rules for conflict resolution (see the “Future Work” section) |
7 | Easy integration into the ETL pipeline setup: only 3 lines of code, set up per data element: 1 line (see the “Future Work” section). |
8 | Time measurements confirmed satisfying results. |
9 | We achieved a code coverage of >90%, confirming that the code is comprehensively verified (quality aspect for software). We successfully verified the provenance with unit tests and validated all results against the defined requirements. |
aETL: extract, transform, and load.
bFHIR: Fast Healthcare Interoperability Resources.
cW3C: Word Wide Web Consortium.
dRDF: Resource Description Framework.
Discussion
Principal Findings
Our study introduces the first ready-to-use library designed to record provenance information from clinical data processing pipelines in a German medical DIC. This current research extends previous work in provenance by using an approach that systematically combines detailed insights from medical, data management, and information technology operational experts. This method aims to facilitate the reuse of enriched patient data with precision and rigor. We demonstrated that our research approach successfully facilitates the implementation of traceability in the processing of data elements. This, in turn, contributes to the promotion of good data management and documentation practices, ultimately ensuring sufficient provenance quality. Furthermore, these good practices pave the way for the (automated) generation of annotations [
] and prevent poor data integrity, thereby enhancing data quality [ ]. Through this, we hypothesize that our work could contribute to the reliability and safety of quality-assured patient data for secondary use. Simultaneously, we mitigate the risks associated with the reuse of weak data in clinical research.We fulfilled the requirement for FAIR (Findability, Accessibility, Interoperability, and Reusability) provenance information by adhering to standards for syntactic and semantic interoperability, including JSON, W3C PROV, and FHIR mapping. Compared with the FHIR resource Provenance, we noted that our metadata recording offers significantly more detailed contextual information for each data element. We suggest that improvements to the FHIR Provenance resource, particularly for data within medical DICs, be deliberated and harmonized with existing FHIR resources such as “AuditEvent” or the “FiveWs Pattern” [
].The strengths of this study are (1) the provision of provenance information for data elements with export options to interchange standard formats such as FHIR-JSON or W3C RDF/XML; (2) the simplicity of integrating this provenance class into ETL and other data pipelines; and (3) the extensibility of metadata components along with acceptable runtime measurements.
Related Work
In general, research on provenance and related management has progressed significantly in recent years. Numerous studies have been conducted, both domain specific and domain independent, focusing on provenance. Recently submitted scoping review results on provenance tracking have yielded valuable insights and provided an extensive summary of current approaches and criteria [
]. The scoping review revealed technical, implementation, and knowledge gaps, with a specific emphasis on modeling and metadata frameworks for (sensitive) scientific biomedical data. Moreover, the primary focus of the research was centered on workflow provenance. This involved the utilization of models such as the Open Provenance Model or the W3C PROV data model across various semantic levels and tools in scientific workflows or experiments, as demonstrated in frameworks such as BioWorkbench or the OpenPREDICT use case [ , ]. Additionally, other work has delved into different yet more general approaches for metadata usage and harvesting [ , ]. A systematic literature analysis on functional requirements for medical data integration outlined general requirements for data traceability and metadata management [ ].While these prior efforts are crucial, they still lack the specific requirements and considerations tailored for a DIC use case. By contrast, our approach is finely tuned to the unique needs of a DIC, providing a comprehensive exploration of provenance that imparts medical meaning and understanding to the data elements, thereby enhancing their reusability.
Lessons Learned
We discovered that interdisciplinary competence profiles; fostering communication between medical experts, data stewards, and information technology developers; and establishing a common language were pivotal factors leading to significant progress in our specific DIC use case. Implementing proper data governance and comprehensive data management documentation, such as data management plans, would be instrumental in mitigating the risk of incorrect use of the data.
The lessons learned from our description could serve as motivation for other researchers aiming to establish FAIR-oriented provenance. This would not only advance the reuse of their research data and results but also underscore the importance of maintaining overall responsibility for the data, even after project funding concludes.
Future Work
Future work should also prioritize the development of a strategy for assessing data privacy, data integrity, and related quality of a data element. Integrating this information into the framework would enhance the expressiveness of the provenance information and enable the derivation of quality dimensions. For this reason, data elements may need to be accompanied by additional properties (refer to
) that are significant for interpretability, helping determine limitations or detect duplications for use in similar research studies. Addressing the adequacy and relevance of the data element for upcoming research questions aids in supporting interpretation and, consequently, the reuse of a data element, as already highlighted in a draft Food and Drug Administration guidance [ ]. To facilitate easy integration with other programming languages, we will provide an application programming interface.Future studies should also explore ways to enhance the script for generating the provenance class in alignment with the FAIR for Research Software Principles [
]. Determining appropriate software metadata that accurately describe the specific characteristics of the software is an essential aspect to be addressed [ ].Before the future implementation and integration of the provenance class into real-world data integration processes, it is advisable to seek recommendations for risk measures. Factors such as the confidentiality level and security of provenance information, storage considerations, performance issues, and scalability should be carefully considered. In addition, it is crucial to consider experiences gained from maintaining metadata management and interoperable technologies, especially from professional data stewards. Ongoing exchanges with stakeholders and conducting usability evaluations are essential aspects that should be taken into account.
This work also contributes to a broader community project that seeks to establish the “Minimal Requirements for Automated Provenance Information Enrichment” (MIRAPIE) project [
].Limitations
As the library has only been tested with simulated data, the next step—testing in a real environment—is currently in preparation. Despite the straightforward ETL integration approach, we will carefully assess the complexity and associated costs of implementation within the medical DIC. We recognize the need to bolster the overall qualification and validation concept. We believe it is crucial to expand the current provenance class to one that is inspection- or audit-ready, although accreditation demands additional measures and efforts. Additionally, further scalability analysis should be incorporated into the research approach.
Trust involves more than just the provenance of data elements; it also implies correctness and security against malicious users. This challenge can only be addressed through technical access limitations and organizational measures. Nevertheless, automated provenance traces can contribute to building trust in the transformation and movement of data within the DIC. Moreover, it empowers us to confidently assess the quality and validity of the original data points even after undergoing complex transformations within a data warehouse.
Conclusions
We have designed, developed, and implemented provenance traces at the data element level for a German medical DIC, with the potential for extension at the national level. The described research method for the proof-of-concept provenance class has been crafted to promote effective and reliable core data management practices, enriching biomedical data with meaningful provenance. This, in turn, strengthens the benefits for research and society while simplifying the reuse of biomedical data. While the approach was initially developed for the medical DIC use case, these principles can be applied universally throughout the scientific domain. The implementation and analysis of provenance traces play a crucial role in minimizing risks associated with undetected or unintended data integrity breaches. Hence, provenance traces significantly contribute to building trust in routine clinical data and enhancing the accountability of a medical DIC. We are confident that by adhering to this advanced practice, the existing gaps between industry (pharmaceutical companies), service providers, and academia can be mitigated. Consequently, this can lead to an increase in the secondary use of (sensitive) patient data in clinical investigations.
The outcomes of our research prompt additional questions, particularly regarding how in-depth exploration of further provenance analysis can predict the quality of data using machine learning methods. The limitations identified in our study indicate the need for further investigations into provenance theory, standards, and practices in the clinical field.
Acknowledgments
This research is funded by the “Digitale Forschung” project of Baden-Wuerttembergg, Germany, and by the German Federal Ministry of Education and Research within the German Medical Informatics Initiative with the grant 01ZZ1801E (Medical Informatics in Research and Care in University Medicine). This work was part of the precondition for the first author to obtain the degree Dr. sc. hum. from Heidelberg University. For the publication fee, we acknowledge financial support by Deutsche Forschungsgemeinschaft within the funding program “Open Access Publikationskosten” as well as by Heidelberg University.
Data Availability
The code of the provenance class is provided in a git repository [
].Authors' Contributions
KG contributed substantially to the methodology, coding, implementation, testing, validation, analysis, visualization, and interpretation of the data; drafted all sections of the manuscript, performed data curation, coordinated reviewing, incorporated the comments from the co-authors, and submitted the paper. DW contributed to the discussion of the general provenance concept, reviewed, and revised the manuscript. TG reviewed and revised the manuscript. FS contributed to the discussion of the methodology, performed a code review, supported implementation, reviewed, and revised the manuscript.
Conflicts of Interest
None declared.
References
- 2018. Metadata Basics. URL: https://www.dublincore.org/resources/metadata-basics/ [accessed 2023-02-10]
- Douthit BJ, Del Fiol G, Staes CJ, Docherty SL, Richesson RL. A Conceptual Framework of Data Readiness: The Contextual Intersection of Quality, Availability, Interoperability, and Provenance. Appl Clin Inform. May 21, 2021;12(3):675-685. [FREE Full text] [CrossRef] [Medline]
- Gierend K, Krüger F, Genehr S, Hartmann F, Siegel F, Waltemath D, et al. Capturing provenance information for biomedical data and workflows: A scoping review. Research Square. Preprint posted online on February 09, 2023. [CrossRef]
- Zhang J, Symons J, Agapow P, Teo JT, Paxton CA, Abdi J, et al. Best practices in the real-world data life cycle. PLOS Digit Health. Jan 18, 2022;1(1):e0000003. [FREE Full text] [CrossRef] [Medline]
- Semler S, Wissing F, Heyder R. German Medical Informatics Initiative. Methods Inf Med. Jul 17, 2018;57(S 01):e50-e56. [CrossRef]
- Shin EY, Ochuko P, Bhatt K, Howard B, McGorisk G, Delaney L, et al. Errors in Electronic Health Record–Based Data Query of Statin Prescriptions in Patients With Coronary Artery Disease in a Large, Academic, Multispecialty Clinic Practice. JAHA. Apr 17, 2018;7(8):e007762. [CrossRef]
- Murray ML, Love SB, Carpenter JR, Hartley S, Landray MJ, Mafham M, et al. Data provenance and integrity of health-care systems data for clinical trials. The Lancet Digital Health. Aug 2022;4(8):e567-e568. [CrossRef]
- Emanuel EJ, Emanuel LL. What is accountability in health care? Ann Intern Med. Jan 15, 1996;124(2):229-239. [CrossRef] [Medline]
- Curcin V. Embedding data provenance into the Learning Health System to facilitate reproducible research. Learn Health Syst. Apr 27, 2017;1(2):e10019. [FREE Full text] [CrossRef] [Medline]
- Bongiovanni S, Purdue R, Kornienko O, Bernard R. Quality in Non-GxP Research Environment. In: Handb Exp Pharmacol. Cham, Switzerland. Springer International Publishing; 2020;1-17. [CrossRef]
- Sahoo SS, Valdez J, Rueschman M. Scientific Reproducibility in Biomedical Research: Provenance Metadata Ontology for Semantic Annotation of Study Description. AMIA Annu Symp Proc. 2016;2016:1070-1079. [FREE Full text] [Medline]
- Bargaje C. Good documentation practice in clinical research. Perspect Clin Res. Apr 2011;2(2):59-63. [FREE Full text] [CrossRef] [Medline]
- Guidelines for Safeguarding Good Research Practice. URL: https://wissenschaftliche-integritaet.de/en/code-of-conduct/ [accessed 2022-11-18]
- Debruyne C, Pandit HJ, Lewis D, O'Sullivan D. "Just-in-time" generation of datasets by considering structured representations of given consent for GDPR compliance. Knowl Inf Syst. Apr 15, 2020;62(9):3615-3640. [FREE Full text] [CrossRef] [Medline]
- 14. ISO 20691. ISO 20691:2022, Biotechnology. URL: https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/88/68848.html [accessed 2022-11-18]
- Standards by ISO/TC 276. URL: https://www.iso.org/committee/4514241/x/catalogue/ [accessed 2022-11-18]
- Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. Mar 15, 2016;3(1):160018. [FREE Full text] [CrossRef] [Medline]
- Lamprecht A, Garcia L, Kuzak M, Martinez C, Arcila R, Martin Del Pico E, et al. Towards FAIR principles for research software. DS. Jun 12, 2020;3(1):37-59. [FREE Full text] [CrossRef]
- HL7 FHIR Foundation enabling healthcare interoperability through FHIR. URL: https://fhir.org/ [accessed 2023-02-10]
- W3C PROV Overview. URL: https://www.w3.org/TR/prov-overview/ [accessed 2022-11-18]
- Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, et al. The Open Provenance Model core specification (v1.1). Future Generation Computer Systems. Jun 2011;27(6):743-756. [CrossRef]
- Sikos LF, Philp D. Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs. Data Sci Eng. May 08, 2020;5(3):293-316. [CrossRef]
- Gierend K, Freiesleben S, Kadioglu D, Siegel F, Ganslandt T, Waltemath D. The Status of Data Management Practices Across German Medical Data Integration Centers: Mixed Methods Study. J Med Internet Res. Nov 08, 2023;25:e48809. [FREE Full text] [CrossRef] [Medline]
- DeMarco T. Structure AnalysisSystem Specification. In: Broy M, Denert E. editors. Pioneers and Their Contributions to Software Engineering Berlin, Heidelberg. Springer Berlin Heidelberg; 1979;255.
- Bornberg-Bauer E, Paton NW. Conceptual data modelling for bioinformatics. Brief Bioinform. Jun 01, 2002;3(2):166-180. [CrossRef] [Medline]
- Lim C, Lu S, Chebotko A, Fotouhi F. Prospective and Retrospective Provenance Collection in Scientific Workflow Environments. In: ProspectiveRetrospective Provenance Collection in Scientific Workflow Environments IEEE International Conference on Services Computing Miami, FL. USA. IEEE; Presented at: Conference on Services Computing; 2010, 2010;449; Miami, FL, USA. [CrossRef]
- Provenance in Data Integration Center, WebProtégé. URL: https://webprotege.stanford.edu/#login [accessed 2023-05-10]
- Python. URL: https://www.python.org/ [accessed 2022-11-18]
- 15. Peewee documentation. URL: https://docs.peewee-orm.com/en/latest/ [accessed 2022-11-18]
- Woodward M. Include diagrams in your Markdown files with Mermaid. The GitHub Blog. URL: https://github.blog/2022-02-14-include-diagrams-markdown-files-mermaid/ [accessed 2022-12-02]
- R: A language and environment for statistical computing. Vienna, Austria. R Foundation for Statistical Computing URL: https://www.R-project.org/ [accessed 2022-12-02]
- GitHub: kegieKG/Provenance-in-Data-Integration-Center. URL: https://github.com/kegieKG/Provenance-in-Data-Integration-Center [accessed 2023-05-10]
- Vorisek CN, Lehne M, Klopfenstein SAI, Mayer PJ, Bartschke A, Haese T, et al. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform. Jul 19, 2022;10(7):e35724. [FREE Full text] [CrossRef] [Medline]
- de Oliveira W, Braga R, David JMN, Stroele V, Campos F, Castro G. Visionary: a framework for analysis and visualization of provenance data. Knowl Inf Syst. Jan 04, 2022;64(2):381-413. [CrossRef]
- Mitchell SN, Lahiff A, Cummings N, Hollocombe J, Boskamp B, Field R, et al. FAIR data pipeline: provenance-driven data management for traceable scientific workflows. Philos Trans A Math Phys Eng Sci. Oct 03, 2022;380(2233):20210300. [FREE Full text] [CrossRef] [Medline]
- Mondelli M, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, et al. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ. 2018;6:e5551. [FREE Full text] [CrossRef] [Medline]
- Celebi R, Rebelo Moreira J, Hassan A, Ayyar S, Ridder L, Kuhn T, et al. Towards FAIR protocols and workflows: the OpenPREDICT use case. PeerJ Comput Sci. 2020;6:e281. [FREE Full text] [CrossRef] [Medline]
- Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. Feb 19, 2019;6(1):190021. [FREE Full text] [CrossRef] [Medline]
- Bönisch C, Kesztyüs D, Kesztyüs T. Harvesting metadata in clinical care: a crosswalk between FHIR, OMOP, CDISC and openEHR metadata. Sci Data. Oct 28, 2022;9(1):659. [FREE Full text] [CrossRef] [Medline]
- Kinast B, Ulrich H, Bergh B, Schreiweis B. Functional Requirements for Medical Data Integration into Knowledge Management Environments: Requirements Elicitation Approach Based on Systematic Literature Analysis. J Med Internet Res. Feb 09, 2023;25:e41344. [FREE Full text] [CrossRef] [Medline]
- Girman CJ, Ritchey ME, Lo Re V. Real-world data: Assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Pharmacoepidemiol Drug Saf. Jul 03, 2022;31(7):717-720. [FREE Full text] [CrossRef] [Medline]
- Barker M, Chue Hong NP, Katz DS, Lamprecht A, Martinez-Ortiz C, Psomopoulos F, et al. Introducing the FAIR Principles for research software. Sci Data. Oct 14, 2022;9(1):622. [FREE Full text] [CrossRef] [Medline]
- Gierend K, Wodke J, Genehr S, Gött R, Henkel R, Krüger F, et al. TAPP: Defining standard provenance information for clinical research data and workflows - Obstacles and opportunities. Companion Proceedings of the ACM Web Conference 2023 Austin TX USA. Association for Computing Machinery, New York, NY, USA; Presented at: In Companion Proceedings of the ACM Web Conference 2023 (WWW '23 Companion); 2023-04-30, 2023;1551-1554; Austin, TX, USA. [CrossRef]
Abbreviations
ALCOA: Attributable, Legible, Contemporaneous, Original, and Accurate |
DIC: data integration center |
ETL: extract, transform, and load |
FAIR: Findability, Accessibility, Interoperability, and Reusability |
FHIR: Fast Healthcare Interoperability Resources |
GDPR: General Data Protection Regulation |
MIRAPIE: Minimal Requirements for Automated Provenance Information Enrichment |
MIT: Massachusetts Institute of Technology |
PISA: Provenance Information System Traces |
RDF: Resource Description Framework |
UML: unified modeling language |
W3C: Word Wide Web Consortium |
Edited by A Mavragani; submitted 16.06.23; peer-reviewed by T Miksa; comments to author 05.10.23; revised version received 25.10.23; accepted 01.11.23; published 07.12.23.
Copyright©Kerstin Gierend, Dagmar Waltemath, Thomas Ganslandt, Fabian Siegel. Originally published in JMIR Formative Research (https://formative.jmir.org), 07.12.2023.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.