GIT and CI/CD

Author

Paul van Genuchten

Published

May 9, 2023

This page introduces a number of generic Git functionalities and vendor add ons. Which can support communities in efficient co-creation of content. The page mainly focusses on the Continuous Integration & Deployment functionality, but contains many external links to introduce other aspects of Git. Considering the previous materials, a relevant ci-cd case is a set of tasks to run after a change to some of the mcf documents in a data repository, to validate the mcf’s and convert them to iso19139 and push them to a catalogue.

GIT content versioning

In its core GIT is a version management system traditionally used for maintaining software codes. In case you never worked with GIT before, have a look at this Git & Github explanation. Some users interact with Git via the command line (shell). However excellent Graphical User Interfaces exist to work with Git repositories, such as Github Desktop, a Git client within Visual Studio, TortoiseGit, Smartgit, and many others.

These days GIT based coding communities like Github, Gitlab, Bitbucket offer various services on top of Git to facilitate in co-creation of digital assets. Those services include authentication, issue management, release management, forks, pull requests and CI/CD. The types of digital assets maintained via GIT vary from software, deployment scripts, configuration files, documents, website content, metadata records up to actual datasets. Git is most effective with text based formats, which explains the popularity of formats like CSV, YAML, Markdown.

CI/CD

Continuous Integration & Deployment describes a process in which changes in software or configuration are automatically tested and deployed to a relevant environment. These processes are commonly facilitated by GIT environments. With every commit to the Git repository an action is triggered which runs some tasks.

Github Pages exercise

This exercise introduces the CI-CD topic by setting up a basic markdown website in Github Pages, maintained through Git. Markdown is a popular format to store text with annotations on Git.The site will be based on Quarto. Quarto is one of many platforms to generate a website from a markdown repository.

Create a new repository in your github account, for example ‘My first CMS’. Tick the ’’
Before we add any content create a branch ‘gh-pages’ on the repository, this branch will later contain the generated html sources of the website.
Create file docs/index.md and docs/about.md. Start each file with a header:

---
title: Hello World
author: Peter pan
date: 2023-11-11
---

Add some markdown content to each page (under the header), for example:

# Welcome

Welcome to *my website*.

- I hope you enjoy it.
- Visit also my [about](./about.md) page.

Now click on Actions in the github menu. Notice that Github has already set up a workflow to publish our content using jekyll, it should already be available at https://user.github.io/repo.

Using Quarto

In LSC-hubs we’ve selected an alternative to jekyll, called quarto. In order to activate Quarto you need to set a number of items yourself.

Create a file _quarto.yml into the new git repository, with this content:

project:
  type: website
website:
  title: "hello world"
  navbar:
    left:
      - href: index.md
        text: Home
      - about.md
format:
  html:
    theme: cosmo
    toc: true

Remove the existing workflow, generated by Github in Actions, Workflows, Remove
First you need to allow the workflow-runner to make changes on the repository. For this, open Settings, Actions, General. Scroll down to Workflow permissions. Tick the Read and write permissions and click Save. If the option is grayed out, you first need to allow this feature in your organization.
Then, from Actions, select New workflow, then set up a workflow yourself.
On the next page we will create a new workflow script, which is stored in the repository at /.github/workflows/main.yml.

name: Docs Deploy

on:
  push:
    branches: 
      - main

jobs:
  build-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Check out repository
        uses: actions/checkout@v3
      - name: Set up Quarto
        uses: quarto-dev/quarto-actions/setup@v2
        with: 
          tinytex: true 
          path: docs
      - name: Publish to GitHub Pages (and render)
        uses: quarto-dev/quarto-actions/publish@v2
        with:
          target: gh-pages
          path: docs
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Save the file, via actions you can follow the progress of the workflow at every push to the repository.
On the logs notice how a container is initialised, the source code is checked out, the quarto dependency is installed, the build is made and pushed to the gh-pages branch.

Notice that the syntax to define workflows is different for every CI-CD platform, however they generally follow a similar pattern. For Github identify in the file above:

It defines at what events the workflow should trigger (in this case at push events).
a build job is triggered, which indicates a container image (runs-on) to run the job in, then triggers some steps.
The final step triggers a facility of quarto to publish its output to a github repository

The above setup is optimal for co-creating a documentation repository for your community. Users can visit the source code via the edit on github link and suggest improvements via issues of pull requests. Notice that this tutorial is also maintained as markdown in Git.

Update catalogue from GIT CI-CD

For this scenario we need a database in the cloud to host our records (which is reachable by github workflows). For the training we suggest to use a trial account at elephantsql.com.

At elephantsql, create a new account.
Then create a new Instance of type Tiny (free).
Click on the instance and notice the relevant connection string (URL) and password
Connect your instance of pycsw to this database instance, by updating pycsw.cfg and following the instructions at Catalogue publication
Verify in elephantsql dashboard if the records are correctly loaded.

We will now publish our records from Github to our database.

Create a new repository on Github for the records
Make sure git-scm (or a GUI tool like Git kraken, Smartgit) is intalled on your system.
Clone (download) the repository to a local folder.

git clone https://github.com/username/records-repo.git

Copy the mcf files, which have been generated in Catalogue publication, to a datasets folder in the cloned repository.
Commit and the files

git add -A && git commit -m "Your Message"

Before you can push your changes to Github, you need to set up authentication, generally 2 options are possible: - Using a personal access token - Or using SSH public key

git push origin main

We’ll now set up CI-CD to publish the records

Place the pycsw.cfg file in the root of the repository (including the postgres database connection)
Create a new custom workflow file with this content:

name: Records Deploy

on: 
  push:
    paths:
      - '**'

defaults:
  run:
    working-directory: .

jobs:
  build:
    name: Build and Deploy Records
    runs-on: ubuntu-latest
    steps:
        - uses: actions/checkout@v3
        - uses: actions/setup-python@v4
          with:
              python-version: 3.9
        - name: Install dependencies
          run: |
            sudo add-apt-repository ppa:ubuntugis/ppa
            sudo apt-get update
            sudo apt-get install gdal-bin
            sudo apt-get install libgdal-dev
            ogrinfo --version
            pip install GDAL==3.4.3
            pip install geodatacrawler pycsw sqlalchemy
        - name: Crawl metadata
          run: |
            export pgdc_webdav_url=http://localhost/collections/metadata:main/items
            export pgdc_canonical_url=https://github.com/pvgenuchten/data-training/tree/main/datasets/
            crawl-metadata --dir=./datasets --mode=export --dir-out=/tmp
        - name: Publish records
          run: |   
            pycsw-admin.py delete-records --config=./pycsw.cfg -y
            pycsw-admin.py load-records --config=./pycsw.cfg  --path=/tmp

Verify that the records are loaded on pycsw (through postgres)
Change or add some records to GIT, and verify if the changes are published (may take some time)

Normally, we would not add a connection string to a database in a config file posted on Github. Instead Github offers secrets to capture this type of information.

Cross linking catalogue and GIT

While users are browsing the catalogue (or this page), they may find irregularities in the content. They can flag this as an issue in the relevant Git repository. A nice feature is to add a link in the catalogue page which brings them back to the relevant mcf in the git repository. With proper authorisations they can instantly improve the record, or suggest an improvement via an issue or pull request.

Summary

In this section you learned about using actions in Github (CI/CD). In the next section we are diving into data publication. Notice that you can also use GIT CI/CD mechanisms to deploy or evaluate metadata and data services.