Nanopublications

Atomic, citable, provenance-rich units of scientific knowledge on the semantic web

Lead Summary

Nanopublications are the smallest independently citable units of scientific knowledge, designed to express, distribute, and track individual assertions rather than bundling many claims inside larger documents. Each nanopublication contains exactly one atomic claim—encoded as RDF triples—wrapped in mandatory provenance and attribution metadata and identified by a cryptographically verifiable URI. Over 10 million nanopublications are live on a decentralized, peer-to-peer server network spanning nine countries, forming the world's largest machine-readable, provenance-centric linked data resource for science.

The format emerged from the broader project of making scientific knowledge machine-actionable and FAIR (Findable, Accessible, Interoperable, Reusable). Where traditional scholarly publishing locks assertions inside narrative documents, nanopublications make each assertion independently resolvable, queryable, and attributable. Their adoption has concentrated heavily in life sciences and biomedical domains, though the format is technically domain-agnostic.

Historical Development

The nanopublication concept was defined and operationalized during a foundational period from 2013 to 2016, centered on Tobias Kuhn, Christopher Chichester, and Michel Dumontier. The sequence of landmark papers defined the format and built the infrastructure:

2013: "Broadening the scope of nanopublications" — established the core model and addressed the formalization question
2014: "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data" — provided the cryptographic identity layer
2015: "Publishing without Publishers: a Decentralized Approach" — proposed and prototyped the server network
2016: "Decentralized provenance-aware publishing with nanopublications" in PeerJ Computer Science — the definitive peer-reviewed treatment of the combined system, authored by Kuhn, Chichester, Krauthammer, Queralt-Rosinach, Verborgh, Giannakopoulos, Ngonga Ngomo, Viglianti, and Dumontier

Concurrent with the theoretical work, biomedical deployments established the approach in production. The DisGeNET gene-disease database was published as nanopublications in 2015–2016, becoming the first major biomedical knowledge graph in nanopublication format. The Huntington's Disease experimental data case study demonstrated the model's value for making high-throughput experimental data—which otherwise ends up lost in supplementary materials—machine-readable and citable.

By 2018, a comprehensive review documented over 10 million nanopublications spanning 378,654,287 RDF triples, confirming that the system had moved beyond proof-of-concept into active deployment. More recent extensions (2024–2025) introduced a fourth component called "knowledge provenance" to capture multi-source evidence and conflicting claims.

Core Concepts

The Assertion-First Publishing Model

Nanopublications embody a fundamental shift from document-centric to assertion-centric publishing. In traditional scholarly publishing, claims are bundled inside narratives—a paper may contain dozens of assertions, none independently citable at the claim level. Nanopublications invert this: each assertion is the primary unit of publication. Researchers can publish minimal semantic contributions directly, without aggregating them into documents, enabling distributed and asynchronous knowledge contribution.

Rather than bundling claims within large documents, nanopublications allow individual assertions to be published, discovered, cited, and reused independently.

This shift enables a form of microattribution: each tiny publishable assertion is formally linked to its contributor via a unique scientific identity. A 2011 Nature Genetics study showed that microattribution systems demonstrably increased the reporting of human genetic variants, leading to more comprehensive resources for describing human variation. Nanopublications create the infrastructure to apply this incentive mechanism to scientific claims generally.

FAIR Alignment

Nanopublications are structurally aligned with the FAIR Guiding Principles:

Findable: Each nanopublication receives a unique, persistent Trusty URI
Accessible: The decentralized server network provides open, redundant access
Interoperable: RDF and semantic web standards enable machine-to-machine exchange
Reusable: Provenance and publication info graphs document context, authorship, and derivation

The FAIR Guiding Principles explicitly require that metadata be associated with detailed provenance in machine-readable form as part of the "Reusable" criterion. Nanopublications are designed to satisfy this structurally, embedding provenance as an inseparable component rather than as a post-hoc annotation.

Components & Structure

A nanopublication is technically a four-named-graph RDF dataset, though it is typically described by its three semantic components.

Fig 1

The four-graph nanopublication structure

The Head Graph

The head graph acts as a container that identifies all three component graphs and provides the URI by which the entire nanopublication is referenced. It is what makes the three semantic graphs function as a single, coherent, publishable unit.

The Assertion Graph

The assertion graph contains the core scientific claim expressed as one or more RDF triples in subject–predicate–object form. This is the atomic, core intellectual content of the nanopublication—a single fact, relationship, or finding that can be uniquely identified and independently cited.

The Provenance Graph

The provenance graph describes the derivation and context of the assertion by linking to scientific methods, source data, or experimental procedures used to arrive at the finding. It enables downstream consumers to assess the credibility and reliability of the assertion by tracing its epistemic origins. In the DisGeNET deployment, for example, the provenance graph distinguishes expert-curated associations from text-mined ones—a distinction directly relevant for drug discovery decisions.

The Publication Info Graph

The publication info graph contains metadata about the nanopublication itself: who created it (the creator's persistent identifier) and when (timestamp). This graph is explicitly separated from the provenance graph, distinguishing who published the nanopublication from how the assertion was derived. Across the 10+ million live nanopublications, this graph records an average of 4.4 creators per nanopublication and 47+ million creator mentions in total.

Serialization

TriG and N-Quads are the recommended RDF serialization formats for nanopublications. Both support the quad representation (subject–predicate–object–graph) necessary to express named graph structure. TriG provides compact prefix-based notation; N-Quads provides a line-oriented flat format suited to streaming and concatenation.

Mechanism & Process

Trusty URIs: Cryptographic Identity

Every nanopublication is identified by a Trusty URI—a URI that embeds a cryptographic hash computed over the entire content of the nanopublication. This design has several consequences:

Immutability by construction: Any change to the underlying content produces a different hash and therefore a different URI. No external write-protection mechanism is needed; identity and immutability are the same property.
100% verifiable: Any retrieved nanopublication can be verified against its identifier with certainty. A hash mismatch definitively indicates tampering or corruption.
Transitive integrity: When a nanopublication links to other trusty URI-identified resources, integrity guarantees propagate transitively. Modifying any resource in the reference chain breaks the hash chain.
Digital signature integration: Digital signatures can be embedded by computing the signature over the full nanopublication content, then computing the trusty URI hash as the final step to cover the signature itself, yielding a signed, immutable, verifiable artifact.

Trusty URIs use two modules: Module R operates at the RDF graph level (format-independent); Module F operates at the byte level for binary or unstructured files.

The Decentralized Server Network

Nanopublications are published on a peer-to-peer network of fifteen server instances across nine countries. The network has no single point of failure: nanopublications are automatically replicated across servers, and data persistence is independent of any individual server's uptime.

The immutability property of Trusty URIs directly simplifies distributed systems engineering: servers only add new entries, never update them. This eliminates concurrency control and identifier management problems endemic to distributed write systems. Identifier uniqueness is guaranteed cryptographically, not through centralized coordination.

Publishing a nanopublication

A researcher uploads a nanopublication to any server on the network. The server assigns it a Trusty URI, stores it locally, and replicates it to peer servers. The same nanopublication can then be retrieved from any server using the same identifier, with cryptographic verification at retrieval time.

Notable Examples

DisGeNET

DisGeNET, a comprehensive database of human gene-disease associations, integrating expert-curated databases (UniProt, CTD, GAD, MGD) and text-mined associations, was published as nanopublications with Trusty URIs. It became the first major biomedical knowledge graph deployed in nanopublication format, demonstrating that distinguishing curated from text-mined evidence at the assertion level supports evidence-based hypothesis generation in drug discovery.

Huntington's Disease Case Study

A case study applying nanopublications to Huntington's Disease experimental data addressed a documented failure mode in science: high-throughput experimental data archived as supplementary material in arbitrary formats becomes lost from scientific discourse, elusive to automated search and processing. Nanopublications provided the attribution incentive and semantic structure to make that data reusable.

Biodiversity Data Journal

The Biodiversity Data Journal (BDJ) deployed a production nanopublications workflow through the ARPHA Writing Tool, enabling authors to add nanopublication sections during manuscript preparation. The system integrates controlled vocabularies from ChecklistBank, Catalogue of Life, GBIF, GenBank/ENA, BOLD, and Darwin Core for species occurrences, taxonomy, and biotic interactions—representing the first large-scale, journal-integrated nanopublication deployment in biodiversity publishing.

OpenBiodiv

OpenBiodiv is an RDF-based biodiversity knowledge graph that extracts information from 5,000+ scholarly articles and integrates them using the OpenBiodiv-O ontology and GBIF taxonomic backbone, updated daily via Apache Kafka streaming. It enables complex cross-database queries that were previously impossible, exemplifying nanopublication principles at ecosystem scale.

COVID-19 Knowledge Graphs

The COVID-19 period accelerated deployment of nanopublication-style structured data. CovidPubGraph contains 268 million RDF triples linked to 9 other datasets with over 1 million semantic links, following FAIR and Linked Data principles for rapid knowledge synthesis during crisis.

Nanopublications vs. Micropublications

Micropublications extend nanopublications by explicitly modeling complete scientific arguments: a focal claim paired with supporting or contradicting evidence, qualifications, and rebuttals. Nanopublications represent atomic assertions with provenance; micropublications capture the structured argumentation that natural language scholarly communication uses to surround those assertions.

Nanopublications vs. Research Objects (RO-Crate)

Research Objects bundle complete experimental workflows—datasets, code, configuration, results, and documentation—into semantically typed packages, addressing reproducibility at the workflow and execution level. Nanopublications operate at the level of individual assertions. The two are complementary: a Research Object might cite nanopublications as evidence sources while itself being cited as the provenance behind new nanopublications.

Nanopublications vs. Research Knowledge Graphs

Research Knowledge Graphs (RKGs) represent scholarly information at the meta-level of articles, contributions, and fields—publication metadata, authorship, institutions, concepts. They are centralized or federated infrastructure for scholarly discovery. Nanopublications operate inside the research, at the level of individual scientific claims and their evidence chains. The Open Research Knowledge Graph (ORKG) consumes nanopublications as inputs.

Nanopublications vs. CiTO

CiTO (Citation Typing Ontology) enables machine-readable description of citation intent and rhetorical relationships between scholarly works. It operates at the level of citations between papers, not at the level of individual assertions within papers. Both use RDF and are part of the SPAR ontology suite for semantic scholarly communication.

Controversies & Debates

The Formalization Barrier

Requiring scientific results to be expressed in formal RDF/OWL is the most consistently documented barrier to nanopublication adoption. Scientists present conclusions in natural language; many find formalization either unrealistic or unduly restrictive. A 2023 field study found that even authors who participated in a nanopublication-based peer review process acknowledged the formalization requirement as a barrier to routine use. The foundational 2013 paper recognized this and proposed allowing informal English annotations alongside formal assertions as a transitional measure.

Domain Concentration

Despite being designed as a domain-agnostic format, nanopublications in practice are overwhelmingly concentrated in life sciences. The 10+ million live nanopublications are predominantly from genes, proteins, diseases, drugs, and biological pathways. A 2023 proof-of-concept on physician suicide explicitly noted that "nanopublications remain to be adopted in broader scientific publishing in medicine." Physics and chemistry have seen essentially no uptake.

Humanities Incompatibility

Humanities scholarship is particularly poorly suited to nanopublications. Humanities arguments and data are inextricably tied to discursive context that cannot be fully captured in formal RDF schemas. The interpretive plurality, historical context, and nuance that are epistemologically central to humanities inquiry conflict with the fixed schema and formal expression requirements of the nanopublication model. A case study using nanopublications for a period gazetteer (PeriodO) found the practical solution required deliberately simplifying the schema to only what was "necessitated by the practical needs" of users—a conscious retreat from nanopublication ideals.

Metadata Overhead

Each nanopublication's three-graph structure creates significant metadata overhead: auxiliary information about the structure, plus repetitive provenance data, multiplies the volume of RDF triples stored and transmitted. The 2018 comprehensive review explicitly noted this "explosion in the number of triples" was "probably hindering further adoption." The tradeoff between infrastructure cost and enhanced scientific attribution and integrity remains unresolved.

Dependency on a Struggling Ecosystem

Nanopublications are built on top of the Semantic Web and RDF/Linked Data infrastructure—an ecosystem that has itself faced persistent adoption barriers and limited large-scale uptake over two decades. Steep learning curves for RDF and SPARQL, lack of major vendor support, complexity of building accurate ontologies, and organizational preference for simpler database technologies constrain nanopublication adoption within a proven-difficult ecosystem.

Current Status

Nanopublications are positioned as components of the European Open Science Cloud (EOSC) ecosystem, framed as FAIR Digital Objects aligned with the EOSC Interoperability Framework. Research has established "remarkable congruence" between proposed FAIR Digital Object specifications and the existing nanopublication infrastructure.

Recent work (2024–2025) extends the core model with a "knowledge provenance" fourth component, captured via the PROV-K ontology, to represent multi-source evidence and conflicting claims—moving from a model that represents individual assertions to one that can represent contested knowledge landscapes.

The provenance-based trust model continues to be studied critically. Provenance metadata and trust networks cannot alone guarantee source reliability or verify numeric validity: computational trust systems may weight sources based on stylistic consistency rather than methodological rigor, and well-documented provenance trails can embed systematic errors or intentional deception. Supplementary verification—statistical validation, comparative source evaluation, human expert judgment—remains necessary.

Key Takeaways

Nanopublications are the smallest independently citable units of scientific knowledge Each nanopublication contains exactly one atomic claim encoded as RDF triples, wrapped in mandatory provenance and attribution metadata. Over 10 million live on a decentralized peer-to-peer server network, forming the world's largest machine-readable, provenance-centric linked data resource for science.
Assertion-centric publishing inverts the traditional document-centric model Rather than bundling claims inside narrative documents, nanopublications make each assertion independently resolvable, queryable, and attributable. This enables distributed and asynchronous knowledge contribution through microattribution.
Trusty URIs provide cryptographic identity and immutability Every nanopublication is identified by a URI that embeds a cryptographic hash of its entire content. Any change produces a different hash and URI, guaranteeing immutability by construction rather than through external write-protection mechanisms.
Adoption is concentrated in life sciences despite domain-agnostic design The 10+ million live nanopublications are predominantly from genes, proteins, diseases, drugs, and biological pathways. Physics, chemistry, and humanities scholarship have seen minimal uptake, largely due to formalization barriers and domain-specific incompatibilities.
The semantic web infrastructure creates both opportunity and constraint Nanopublications depend on RDF, SPARQL, and linked data technologies—an ecosystem with persistent adoption barriers and limited vendor support. This constrains scaling while enabling the machine-actionable semantic representation that powers FAIR compliance.

Further Exploration

Foundational Papers

Decentralized provenance-aware publishing with nanopublications — Kuhn et al. (2016) — the definitive peer-reviewed treatment of the complete system
Broadening the scope of nanopublications — Kuhn et al. (2013) — original scope and formalization discussion
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data — The cryptographic identity layer

Empirical & Case Studies

Nanopublications: A Growing Resource of Provenance-Centric Scientific Linked Data — Dumontier et al. (2018) — comprehensive review with network-scale data
Publishing DisGeNET as Nanopublications — The landmark biomedical production deployment
Nanopublications tailored to biodiversity data — Biodiversity Data Journal integration
Semantic micro-contributions with decentralized nanopublication services — The case for assertion-centric publishing

Infrastructure & Access

Nanopublication Guidelines — The normative specification
Search, access, and explore life science nanopublications on the Web — Practical access and discovery infrastructure

Barriers & Debates

Lowering Barriers to RDF and Linked Data Adoption — The broader ecosystem barriers constraining nanopublication deployment
Humanities and Nanopublications: The PeriodO Case Study — Why humanities scholarship is poorly suited to the model

Quick reference

Type Publishing model / data format

Field Semantic web, scholarly communication

Foundational period 2013–2016

Key figures Tobias Kuhn, Michel Dumontier, Christopher Chichester

Technical basis RDF named graphs, Trusty URIs

Network scale 10+ million published nanopublications

Primary domain Life sciences and biomedical research

FAIR alignment Findable, Accessible, Interoperable, Reusable

Serialization TriG, N-Quads