Metadata at the source
The FAIR principles are designed with the academic community in mind. Researchers sharing information. However the FAIR principles can also be applied within an organisation to improve data management, even on your personal computer. If you return for example to an archived project 2 years later to recover some data sources.
Many organisations organise their documents and datasets on a central network storage or database. These resources are usually clustered in organisational units, projects and/or years. Some files and database tables in that central storage contain embedded metadata, such as the name, size, date, author, location etc. This information supports users in understanding the context of the data source. Especially if that data at some point is migrated from its original context.
For those formats which do not have embedded metadata, or in order to capture additional metadata aspects. We endorse the creation of a sidecar
metadata file for every resource, a dedicated metadata file sharing the name of the datasource. This approach is for example common in the ESRI community, where a .shp.xml
is created alongside any .shp
file, which captures some metadata elements.
Locate on your local computer or network drive a random shapefile. Does the file have a .shp.xml sidecar file? Else find another shape of tiff file (look for *.shp.xml
). The xml file may be very limited, but in most cases at least some processing steps and the data model of the shapefile are mentioned.
Through the embedded metadata and sidecar concept, we endorse data scientists to document their data at the source. Since they know best how the data is produced and how it should be used.
Standardised metadata models
For optimal interoperability, it is important to agree within your group on the metadata standard(s) to use in sidecar files. ESRI software for example provides an option to select the output model of the metadata. QGIS has various plugins, such as GeoCat Bridge, to work with various metadata models.
Does your organisation endorse a metadata model to describe data sources? Are your aware of tooling which can support you in creation of metadata in this model?
The MCF format
Within the geopython community a metadata format is used called the metadata control file (MCF). Aim of the format is ease of use, while providing export options to various metadata models. Many metadate models are based on XML, which makes them quite hard to read by humans. MCF is based on YAML, a text based format using indents to cluster elements. In this workshop we are using the MCF format due to its simplicity and natural fit with the use cases. A minimal sample of MCF is:
mcf:
version: 1.0
metadata:
identifier: 9c36a048-4d28-453f-9373-94c90e101ebe
hierarchylevel: dataset
date: 2023-05-10
identification:
title: My favourite dataset
abstract: A sample dataset record to highlight the options of MCF
...
Describing a resource
When describing a resource, consider which user groups are expected to read the information. This analyses will likely impact the style of writing in the metadata. The UK Geospatial Commission has published some practical recommendations on this topic.
When tagging the dataset with keywords, preferably use keywords from controlled vocabularies like Agrovoc, Wikipedia, etc. Benefit of controlled vocabularies is that the term is not ambigue and it can be made available in multiple languages.
MCF editing
MCF documents can best be written in a text editor like Visual Studio Code. Consider to install the YAML plugin for instant YAML validation.
Another option to create and update mcf files is via MDME. MDME is a web based software package providing a dynamic metadata edit form. An operational package is available at osgeo.github.io. Notice that if you install the package locally, you can customize the application to your organisational needs.
Imagine a dataset you have recently worked with. Then open mdme and populate the form, describing that dataset. Now save the MCF file so we can later place it in a sample data repository.
Notice that MDME also offers capabilities to export directly as iso19139, it uses a webservice based on the tools used in this workshop.
Summary
In this section, you are introduced to a data management approach which maintains metadata at the location where the datasets are maintained, using a minimal, standards complient approach. You are introduced to the MCF metadata format. In the next section, we will go into more detail on interacting with the MCF format.