Data Provenance

Tracing the origin, transformation, and accountability of information across systems

Lead Summary

Data provenance is the systematic documentation of where data came from, how it was produced, who or what transformed it, and how it moved through systems over time. It provides the verifiable chain of origin and methodology that allows reasoning systems, auditors, and human experts to assess data trustworthiness, reproduce results, and establish accountability. Unlike simple logging, provenance captures causal dependencies—it encodes why a piece of data has the value it does by linking it to the entities, activities, and agents responsible for its existence.

Provenance is now foundational infrastructure across a wide range of fields: it underlies the FAIR Guiding Principles for scientific data management, enables regulatory compliance in AI governance, powers misinformation detection in journalism, and supports clinical decision support in healthcare. The canonical formal standard is PROV-O, the W3C Provenance Ontology, published as a Recommendation in April 2013.

Definition & Scope

Data provenance specifically refers to the earliest instance and original source of data. It differs from data lineage, which encompasses the complete transformation history. As captured in distributed systems research, data lineage requires tracking: (1) source-to-derived relationships, (2) transformation operations applied, (3) temporal information about when transformations occurred, and (4) actor information about which service or user performed each operation. Provenance is thus the root claim in that lineage chain—the starting point from which lineage extends.

The scope of provenance spans multiple levels of abstraction:

Data-level: which datasets produced which derived datasets and through what transformations
Workflow-level: the structure and execution trace of computational pipelines
Claim-level: which sources support which specific assertions in a document or model output
Agent-level: which humans, software agents, or AI systems performed which decisions

Historical Development

The Archival Roots

Provenance as a principle originates in archival science, where it serves a foundational role: records from the same creator must be kept distinct from records of different creators to preserve their evidential value. The records continuum model, developed by Frank Upward and colleagues at Monash University in the 1990s, organized recordkeeping activities across institutional and temporal dimensions rather than treating provenance as a single point-in-time attribute. This model foregrounds organizational source identity and chain-of-custody across institutional boundaries—a conceptually different emphasis from the event-centric models that would later dominate computational provenance. Digital preservation has since expanded the challenge: the sheer volume of digital records and complexity of digital environments make tracking and preserving provenance information substantially harder than in paper-based archives.

The Open Provenance Model (2006–2011)

The computational era of provenance began with the Provenance Challenge series, initiated in 2006, which motivated cross-community development of a shared interchange format. This work culminated in the Open Provenance Model (OPM), finalized at v1.1 in June 2011. OPM represented provenance as a directed acyclic graph (DAG) of causal dependencies—a design choice inherited from the causality principle that outputs of a process cannot simultaneously serve as their own inputs. Luc Moreau, the key architect of OPM, went on to chair the W3C Provenance Working Group, ensuring conceptual continuity.

W3C PROV (2013)

The W3C Provenance Working Group formalized this community work into a suite of four normative Recommendations, all published on 30 April 2013:

PROV-DM: the abstract data model
PROV-O: the OWL2 ontology encoding of PROV-DM
PROV-N: a human-readable notation
PROV-CONSTRAINTS: a formal validation specification

Early adoption (2013–2018) was concentrated in semantic web, digital humanities, and scientific workflow management communities. A second wave of adoption began around 2020, driven by ML reproducibility requirements, AI governance pressures, and supply-chain verification demands.

Core Concepts

The Three-Class Model

PROV-O organizes provenance around three core classes:

Entities: real or hypothetical things with fixed aspects in physical or conceptual space (a dataset, a model file, a specific document version)
Activities: processes that occur over time and use or generate entities (a training run, a transformation step, a curation decision)
Agents: entities responsible for performing activities or generating other entities (a person, a software system, an organization)

These three classes, together with bundles (named sets of provenance assertions that are themselves entities, enabling provenance-of-provenance), form the semantic backbone of PROV-O as defined in the W3C specification.

Provenance information captures data quality, reliability, and trustworthiness assessments by recording who caused what entities to be processed by which activities.

Derivation Relationships

The core provenance narrative in PROV-O is expressed through causal relationships: an Entity was derived from another Entity, was generated by an Activity (which used other entities), and was attributed to an Agent through association or responsibility. These relations—generation, usage, derivation, attribution, association—connect the three primitives into a directed acyclic graph (DAG) of causal dependencies tracking how entities flow through processes under the influence of responsible agents.

Qualified Relations

PROV-O provides three tiers of expressiveness:

Starting Point terms (minimal: the three classes plus 9 core properties)
Expanded terms for richer descriptions
Qualified terms that reify binary relations into intermediate objects (Generation, Usage, Association), enabling attachment of contextual attributes such as precise timestamps, agent roles, location information, and execution plans

This systematic pattern applies across twelve core influence relations in the ontology. Qualified relations allow much richer semantic expressiveness than simple binary relations, enabling more accurate modeling of complex data lineage scenarios.

The DAG Constraint and Its Limitations

The acyclicity constraint in PROV's design reflects the causality principle: outputs of a process cannot simultaneously serve as its own inputs. This deterministic, linearity-preserving structure makes provenance diagrams tractable for reasoning and querying. However, it also constrains the model's ability to represent iterative or feedback-driven systems—iterative machine learning workflows, feedback loops, and iterative scientific processes where models are refined through repeated experimentation are poorly served by a strict DAG. PROV-O's bundle mechanism and separate trace recording provide workarounds but do not natively resolve this fundamental acyclic constraint.

The PROV Family and Key Extensions

The W3C PROV standard functions as a universal reference model designed with explicit extensibility points—subclassing and property specialization—that enable domain-specific overlays without breaking interoperability. A rich ecosystem of extensions has emerged:

Scientific Workflow Provenance

ProvONE: developed by the DataONE Cyberinfrastructure Working Group to capture detailed workflow execution traces, enabling interoperable representation of both workflow templates and execution traces across systems like Taverna, Kepler, VisTrails, Galaxy, and Pegasus
ProvONE+: extends ProvONE to explicitly model control-flow structure, enabling conformance checking against workflow specifications (prospective provenance) rather than only recording what happened (retrospective provenance)
CWLProv: a PROV-based standard for Common Workflow Language executions, enabling re-executable workflows with domain-specific information
OPMW-PROV: a hybrid ontology bridging the OPM and PROV lineages, extending both to capture execution traces of workflow templates

Machine Learning Provenance

PROV-ML: a W3C PROV-based model capturing end-to-end lineage across the entire ML lifecycle, incorporating the ML Schema to represent ML-specific artifacts such as models, datasets, and hyperparameters
MLflow2PROV: a tool that continuously extracts PROV-compliant provenance graphs from MLflow experiment tracking systems, capturing end-to-end ML pipeline lineage through integration with Git repositories

Agent and AI Provenance

PROV-AGENT (2025): a W3C PROV extension that tracks AI agent interactions by integrating the Model Context Protocol (MCP) to capture agent-specific artifacts—prompts, responses, model invocations, tool calls—into end-to-end workflow provenance; addresses non-deterministic behavior and hallucination propagation in LLM-based agents

Lightweight Domain Extensions

PAV (Provenance, Authoring and Versioning): specializes PROV-O by adding classes and properties for tracking authoring, curation, and digital creation roles—author, contributor, curator—essential for digital resource management
GDPRov: a GDPR-specific extension modeling consent and data lifecycle provenance
NIDM (Neuroimaging Data Model): a domain-specific extension for neuroimaging workflows

Service Infrastructure

PROV-AQ (Provenance Access and Query): defines HTTP-based mechanisms for accessing and querying provenance information on the web, supporting both simple provenance record lookup and complex query services

RDF Mechanisms for Provenance

Several complementary RDF-based approaches exist for attaching provenance metadata to data:

Named Graphs

Named graphs (RDF quadruples) are currently the most widely adopted approach, being compliant with the RDF 1.1 standard and queryable with SPARQL 1.1. A named graph is a set of triples identified by a URI, allowing reference to statement groups and their associated provenance. Named graphs support temporal tracking, multi-source integration, and knowledge base versioning at the graph scope—enabling applications to track how facts evolve over time without triple-level annotation.

RDF-Star

RDF-star extends RDF to allow statements about statements, enabling provenance metadata to be attached directly to individual triples rather than to groups. This design enables more efficient representation than PROV-O's qualified relation pattern, which requires verbose intermediate reification objects. Many triple store implementations now include dedicated indexes for RDF-star triples. RDF-star is complementary rather than competing with PROV-O: it operates at the native graph model level (general-purpose statement annotation), while PROV-O is a comprehensive domain ontology for expressing provenance of digital objects with causal semantics. A key open question in RDF-star standardization concerns opacity vs. transparency semantics: provenance use cases require distinguishing multiple instances of the same statement pattern originating from different sources, which opacity semantics support but transparency would collapse.

Singleton Properties

Singleton properties represent an alternative approach that creates unique property instances for each relationship context. They suffer from significant scalability issues—verbose representations negatively affect query performance—and named graphs are generally preferred for RDF provenance.

Expressiveness vs. Performance

Provenance representation systems face a fundamental tradeoff: adding specialized provenance concepts reduces query clause complexity and matches domain patterns more directly, but systems with richer provenance support typically experience higher query execution times. This tradeoff is not resolved by any current approach, but shifted depending on which dimension is prioritized.

Nanopublications and Atomic Provenance

Nanopublications represent a publishing format that embeds provenance as an integral architectural element rather than a post-hoc annotation. Each nanopublication consists of exactly three named RDF graphs:

Assertion Graph: the core scientific claim
Provenance Graph: metadata documenting how the assertion was derived, including sources, derivation methods, and supporting evidence
Publication Info Graph: metadata about the nanopublication itself—creator, creation timestamp, licensing

This tripartite structure, carried through named graphs identified by URIs, enables granular provenance tracking, versioning, and trust evaluation at the level of individual semantic units. In biomedical applications (DisGeNET for gene-disease associations, Huntington's disease case studies), the provenance separation between expert-curated and text-mined evidence enables evidence-based hypothesis generation by allowing researchers to distinguish high-confidence curated associations from literature-derived assertions.

Trusty URIs extend verifiability beyond individual nanopublications to entire reference trees: when a nanopublication links to other trusty URI-identified resources, integrity guarantees propagate transitively through hash-chaining. Modification of any resource in the tree would break the chain.

Recent extensions (2024–2025) add a fourth component—"knowledge provenance"—to capture multi-source evidence and conflicting claims within a single nanopublication, using the PROV-K ontology.

Domain Applications

Scientific Research and FAIR Data

The FAIR Guiding Principles (2016) establish that metadata must be associated with detailed provenance as part of the "Reusable" principle. This mandates machine-readable documentation of data origin, attribution, and derivation—typically in RDF or equivalent semantic vocabularies—enabling computational systems to automatically discover, validate, and trace knowledge lineage without human intervention. FAIR compliance inherently requires computational trust infrastructure based on provenance.

Research Objects (RO) bundle scientific artifacts—datasets, workflows, code, results—into semantically typed packages with explicit provenance relationships. RO-Crate extends this model to capture workflow execution provenance at multiple granularity levels, addressing reproducibility at the workflow and execution level rather than at the atomic assertion level.

Machine Learning and AI Reproducibility

Machine learning research faces a documented reproducibility crisis where only approximately half of published results can be independently replicated. Empirical studies identify insufficient metadata, lack of publicly available data, and incomplete study methods as the main barriers. Provenance tracking—recording data lineage, code and data versioning, and experiment metadata—represents a key technical solution, with tools like DVC integrating pipeline automation, data versioning, and provenance capture.

PROV-O adoption experienced resurgence from 2020–2026 driven by increased focus on ML/AI reproducibility, explainability, and supply-chain security verification. This second wave is characterized by practical implementation focus (MLflow2PROV, PROV-AGENT, AIBOM) rather than foundational ontology development.

The AI Bill of Materials (AIBOM) extends CycloneDX SBOM practices to capture AI-specific model lineage and governance metadata—models, datasets, code, hardware, training configurations, hyperparameters—providing audit-ready evidence for regulatory compliance with the EU AI Act and NIST AI Risk Management Framework.

Journalism and Content Authentication

Content provenance technologies enable source attribution and misinformation detection in journalism by recording cryptographic metadata about the origin, creation date, location, authorship, and edit history of audiovisual content. The Coalition for Content Provenance and Authenticity (C2PA) standardizes this infrastructure through Content Credentials—cryptographically signed data structures embedding an asset's origin, modification history, and a hash of the content itself, signed with the private key of the software or hardware that performed operations. This approach is particularly effective against misinformation that relies on recontextualizing or manipulating existing media content.

Cross-platform coalitions including Meta, Adobe, Microsoft, and Publicis Groupe are developing consistent metadata management approaches that influence campaign approval, ad delivery, and consumer trust—recognizing that unified standards are more effective than fragmented platform policies.

Healthcare and Clinical AI

Integrating source-verified provenance into clinical AI decision support systems enhances auditability and trustworthiness of machine-generated medical recommendations. An auditable framework for clinical AI incorporates retrieval-augmented generation with explicit data provenance, enabling clinicians to trace recommendations back to evidence sources and verify claim-evidence coherence. In healthcare, audit trails support clinical governance and reduce liability in high-stakes medical decision-making.

Supply Chains

Blockchain's cryptographic architecture enables end-to-end traceability and provenance recording throughout supply chains by maintaining immutable, tamper-proof records of transactions. Each transaction records details including origin, production, quality checks, and ownership transfers. The distributed ledger's consensus mechanism and hash-chaining ensure information cannot be altered retroactively. Model supply chain security specifically requires pre-execution build-time guarantees and run-time verification of model provenance, with tools providing automated lineage tracing, signature verification (Ed25519), and regulatory compliance mapping.

Data Mesh and Distributed Systems

In data mesh architectures, metadata management and data lineage tracking are critical success factors. Comprehensive data lineage provides visibility into data product dependencies and quality implications across autonomously-owned domain products. In distributed systems, data lineage is essential for debugging emergent failures and maintaining source-of-truth accountability because no single log captures all transformations.

Event sourcing provides an implementation pattern for provenance in distributed systems: by storing the complete sequence of immutable state-change events in an append-only log, event sourcing records every state transition as a discrete event with full provenance information (who, what, when, why), enabling the current state to be reconstructed by replaying events.

Archival Science and Cultural Heritage

Provenance research in museums addresses the restitution of Nazi-era looted art: institutions like the Museum of Applied Arts (MAK) in Vienna have been engaged in systematic provenance research since the 1990s, and Austria's Art Restitution Act (1998) formalized requirements for investigating collection histories. The archival principle that records from the same creator must be kept distinct preserves historical and evidential value in ways that require ongoing institutional commitment.

Agentic Systems and Claim-Level Auditability

As autonomous AI agents become more prevalent, provenance extends beyond data to encompass execution traces—chronological records of agent actions, decisions, and contextual state changes. Observability and auditability frameworks for agents capture:

Cognitive traces: reasoning steps, model outputs
Operational traces: tool calls, API responses
Contextual traces: system state, user inputs

Multi-agent systems further require provenance to track which agents generated which claims, how intermediate outputs were synthesized into final reports, and where conflicts or inconsistencies arose.

Semantic provenance graphs make explicit the claim-evidence relationship by encoding how retrieval, reasoning, and synthesis steps connect retrieved sources through to final claims in persistent, queryable structures. The Auditable Autonomous Research (AAR) standard proposes measuring auditability through provenance coverage, provenance soundness, contradiction transparency, and audit effort metrics—defined as independent verification with effort significantly lower than generation effort.

In agentic security contexts, provenance labeling and trust boundary tracking are architectural defenses that tag which parts of agent context originate from trusted sources (user input, vetted knowledge bases) versus untrusted sources (web content, external APIs). Maintaining explicit provenance metadata as data flows through tool chains allows agents to apply differential trust rules: privileged decisions use only trusted data, while untrusted data is isolated to data-only contexts.

Controversies & Debates

The Verification Problem

Current technical solutions for content provenance—watermarking, cryptographic signatures, C2PA content credentials—face fundamental limitations. They can be circumvented through image processing, produce contradictory signals (a single asset claiming both human authorship via C2PA and AI-generation via watermark), and scale poorly across the internet. The absence of reliable institutional verification infrastructure means audiences have few mechanisms besides distrust for navigating authentic content.

DAG Limitations in Iterative Systems

PROV-O's acyclic graph constraint is fundamentally misaligned with iterative machine learning workflows, feedback loops, and any scientific process where models are refined through repeated experimentation. While workarounds exist (bundles, separate trace recording), they do not natively resolve the constraint. This limits PROV-O's expressiveness for a significant class of modern computational workflows.

Benchmark Gaps

Despite widespread adoption of RDF triple stores for PROV-O provenance, the provenance research community lacks official benchmarks specifically designed to test PROV-O at scale. Existing RDF benchmarking datasets do not account for PROV-specific data model semantics and query patterns, preventing systematic evaluation of PROV-O's true verbosity and query performance characteristics compared to competing approaches.

Dataset Licensing Gaps in AI

A large-scale CMU audit of 1,800+ text datasets revealed widespread gaps in license compliance and attribution documentation across AI training data. Systematic provenance tracing across dataset lineage, source provenance, licensing compliance, and subsequent use is necessary for governance—yet the infrastructure to enforce it at scale remains incomplete.

Domain-Specific Extensions

Specialists in multiple scientific fields have developed domain-specific provenance standards and extensions that build on PROV-O:

Astronomy: The International Virtual Observatory Alliance's VO-PROV model; evaluated for telescope data processing pipelines
Biology/Biotechnology: ISO/DTS 23494-1 (approved 2020), the first international provenance standard for biotechnology, addressing lifecycle documentation of biological materials and research objects; the Provenir Ontology for genetic data
Bioinformatics: CWLProv for computational pipeline documentation in regulatory compliance contexts
Neuroimaging: NIDM (Neuroimaging Data Model) for brain imaging workflows

These extensions demonstrate that while PROV-O provides a universal, domain-agnostic reference model, no single generic model captures all domain-specific needs without specialization.