Interact with data repositories
In this section a crawler tool is introduced which let’s you interact with the metadata in a file based data repository. For this exercise we’ve prepared a minimal data repository containing a number of Excel-, Shape- and Tiff-files. Unzip the repository to a location on disk.
In the root folder of the repository already exists a minimal mcf file, index.yml
. This file contains some generic metadata properties which are used if a file within the repository does not include them. The tool we use is able to inherit metadata properties from this index.yml file through the file hierarchy.
Open index.yml and customise the contact details. Later you will notice that these details will be applied to all datasets which themselves do not provide contact details.
Consider to add additional index.yml files in other folders to override the values of index.yml at top level.
Setup environment
The tool we will use is based on python. It has some specific dependencies which are best installed via Conda. Conda creates a virtual python environment, so any activity will not interfere with the base python environment of your machine. Notice that the next paragraph describes an alternative approach without installing python locally.
If you don’t have Conda, you can install Anaconda or Miniconda and consider to read the getting started.
Now start a commandline or powershell with conda enabled (or add conda to your PATH). On windows look for the Anaconda prompt
in start menu. First we will navigate to the folder in which we unzipped the sample data repository. Make sure you are not in the data
directory but one above.
cd {path-where-you-unzipped-zipfile}
We will create a virtual environment pgdc
(using Python 3.9) for our project and activate it.
conda create --name pgdc python=3.11
conda activate pgdc
Notice that you can deactivate this environment with: conda deactivate
and you will return to the main Python environment. The tools we will install below, will not be available in the main environment.
Install the dependencies for the tool:
conda install -c conda-forge gdal
conda install -c conda-forge pysqlite3
conda install -c conda-forge pandas
Now we will install the crawler tool, GeoDataCrawler. The tool is under active development at ISRIC and facilitates many of our data workflows. It is powered by some popular metadata and transformation libraries; OWSLib, pygeometa and GDAL.
pip install geodatacrawler
Verify the different crawling options by typing:
crawl-metadata --help
Python/GDAL via Docker
In case you have difficulties setting up python with gdal on your local machine (or just want to try out), an alternative approach is available, using python via Docker. Docker is a virtualisation technology which runs isolated containers within your computer.
- First install docker.
- Then in a blank folder create 2 files:
Dockerfile
FROM mambaorg/micromamba:1.5.3
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml
RUN micromamba install --yes --file /tmp/env.yaml
RUN micromamba install -y -n base -c conda-forge gdal pandas pysqlite3
RUN micromamba clean --all --yes
ARG MAMBA_DOCKERFILE_ACTIVATE=1
RUN pip install geodatacrawler
- and
env.yaml
name: base
channels:
- conda-forge
dependencies:
- pyopenssl=20.0.1
- python=3.9.1
- requests=2.25.1
- Build the geodatacrawler image (an image is a blueprint to create containers). Navigate with powershell or commandline to the folder you have just created.
docker build -t org/metatraining .
- Now navigate to the folder where you unzipped the data repository and use the docker image to run the crawler:
docker run -it --rm org/metatraining crawl-metadata --help
For advanced docker statements there are some differences between Windows commandline, Windows powershell and Linux. Use the relevant syntax for your system.
Initial MCF files
The initial task for the tool is to create for every data file in our repository a sidecar file based on embedded metadata from the resource.
crawl-metadata --mode=init --dir=data
docker run -it --rm -v$(pwd):/tmp \
\
org/metatraining crawl-metadata --mode=init --dir=tmp/data
docker run -it --rm -v "${PWD}:/tmp" `
org/metatraining crawl-metadata `
--mode=init --dir=/tmp/data
Notice that for each resource a {dataset}.yml file has been created. Open a .yml file in a text editor and review the content.
Update MCF
The update
mode is meant to be run at intervals, it will update the mcf files if changes have been made on a resource.
crawl-metadata --mode=update --dir=data
docker run -it --rm -v$(pwd):/tmp \
\
org/metatraining crawl-metadata --mode=update --dir=tmp/data
docker run -it --rm -v "${PWD}:/tmp" `
org/metatraining crawl-metadata `
--mode=update --dir=/tmp/data
In certain cases the update mode will also import metadata from remote url’s. This happens for example if the dataset-uri is a DOI. The update mode will ten fetch metadata of the DOI and push it into the MCF.
Export MCF
Finally we want to export the MCF’s to actual iso19139 metadata to be loaded into a catalogue like pycsw, GeoNetwork, CKAN etc.
crawl-metadata --mode=export --dir=data --dir-out=export --dir-out-mode=flat
docker run -it --rm -v$(pwd):/tmp \
\
org/metatraining crawl-metadata --mode=export --dir=tmp/data \
--dir-out=export --dir-out-mode=flat
docker run -it --rm -v "${PWD}:/tmp" `
org/metatraining crawl-metadata `
--mode=export --dir=/tmp/data `
--dir-out=/tmp/export --dir-out-mode=flat
Open one of the xml files and evaluate if the contact information from step 1 is available.
Summary
In this paragraph you have been introduced to the geodatacrawler library. In the next section we are looking at catalogue publication.