1.2 FAIR data

The FAIR data principles applied to Soil data

Paul van Genuchten

2023-09-15

Findable - Accessible - Interoperable - Reusable

go-fair.org

Findable

  • The first step in (re)using data is to find them.
  • Metadata and data should be easy to find for both humans and computers.
  • Machine-readable metadata are essential for automatic discovery of datasets and services.
  • Persistent identification

Persistent identification

Persistent identification, for continued findability

  • Consider that a proper id can outlive a project (or organisation)
  • Choice of domain and path (owned, authoritative, neutral, prevent names)
  • Set up an identification proxy (doi.org/w3id.org)

Metadata

  • Metadata describes title, abstract, author of a resource
  • Facilitate findability and understand if a resource is relevant
  • Can organize resources in groups by tagging

Standards for metadata exchange

  • Facilitate the exchange of resources between communities
  • Protocols to exchange metadata:
Community Standard Protocol
Open data/Sematic web DCAT SPARQL
Science Datacite OAI-PMH
Geospatial iso19115 CSW
Earth observation STAC STAC
Search engines Schema.org json-ld/microdata
Ecology EML

Catalogue

  • Records are brought into a catalogue, where they can be searched and assessed
  • Catalogues can exchange records to increase discoverability
  • Catalogues can cross borders between communities by transforming metadata to relevant community standards and protocols

Search engines

  • Search engines crawl the content of catalogues
  • If a catalogue supports schema.org annotations, the content can also be extracted in a structured way
  • Example

Accessibility

  • (Meta)data are retrievable by their identifier using a standardised communications protocol
  • Metadata are accessible, even when the data are no longer available

Persistence

  • Move the resource to a shared environment (backup)
  • Consider a URL strategy
  • Use a facade identifier (DOI)

Data lifecycle

  • Consider upfront when to remove a resource (10 yrs?)
  • What happens to the URI of a resource which is archived?
  • Metadata should stay available, even if the data are no longer

Repository software

  • Webdav (or webserver software)
  • Zenodo, Dataverse
  • Document Management Systems (DMS)
  • Cloud storage (google drive, dropbox, Amazon, Sharepoint)

testing tools

  • Automated link checking
  • Usage logs filter on status 404 & referer
  • Google Search Console notifies broken links

Interoperable

  • (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • (Meta)data use vocabularies that follow FAIR principles
  • (Meta)data include qualified references to other (meta)data

Universal formats

Facilitates accessing a resource with commonly available tooling, or be refactored if a software is abandoned

  • Proprietary vs Open (eg. ecw vs tiff)
  • De facto vs Formalised (eg. YAML vs XML)
  • Binary vs text based (eg. shapefile vs GeoJSON)
  • Cloud optimised vs Cloud native vs traditional (eg. COG vs GeoZarr vs GeoTiff)
  • Embedded metadata

Adopt common vocabularies

Adopting a standardised model enables aggregation of data.

  • Relational models
  • UML/GML models
  • Semantic web ontologies

Relevant vocabularies

  • ISO28258 / INSPIRE / GLOSIS Web Ontology
  • Agrovoc
  • WRB / FAO Soil Classification Guidelines

Reusable

  • (Meta)data are richly described with a plurality of accurate and relevant attributes

  • (Meta)data are released with a clear and accessible data usage license

  • (Meta)data are associated with detailed provenance

  • (Meta)data meet domain-relevant community standards