GIT and CI/CD
This page introduces a number of generic Git functionalities and vendor add ons. Which can support communities in efficient co-creation of content. The page mainly focusses on the Continuous Integration & Deployment functionality, but contains many external links to introduce other aspects of Git. Considering the previous materials, a relevant ci-cd case is a set of tasks to run after a change to some of the mcf documents in a data repository, to validate the mcf’s and convert them to iso19139 and push them to a catalogue.
GIT content versioning
In its core GIT is a version management system traditionally used for maintaining software codes. In case you never worked with GIT before, have a look at this Git & Github explanation. Some users interact with Git via the command line (shell). However excellent Graphical User Interfaces exist to work with Git repositories, such as Github Desktop, a Git client within Visual Studio, TortoiseGit, Smartgit, and many others.
These days GIT based coding communities like Github, Gitlab, Bitbucket offer various services on top of Git to facilitate in co-creation of digital assets. Those services include authentication, issue management, release management, forks, pull requests and CI/CD. The types of digital assets maintained via GIT vary from software, deployment scripts, configuration files, documents, website content, metadata records up to actual datasets. Git is most effective with text based formats, which explains the popularity of formats like CSV, YAML, Markdown.
CI/CD
Continuous Integration & Deployment describes a process in which changes in software or configuration are automatically tested and deployed to a relevant environment. These processes are commonly facilitated by GIT environments. With every commit to the Git repository an action is triggered which runs some tasks.
Github Pages exercise
This exercise introduces the CI-CD topic by setting up a basic markdown website in Github Pages, maintained through Git. Markdown is a popular format to store text with annotations on Git.The site will be based on Quarto. Quarto is one of many platforms to generate a website from a markdown repository.
Create a new repository in your github account, for example ‘My first CMS’. Tick the ’’
Before we add any content create a branch ‘gh-pages’ on the repository, this branch will later contain the generated html sources of the website.
Create file docs/index.md and docs/about.md. Start each file with a header:
---
title: Hello World
author: Peter pan
date: 2023-11-11
---
Add some markdown content to each page (under the header), for example:
# Welcome
Welcome to *my website*.
- I hope you enjoy it.
- Visit also my [about](./about.md) page.
- Now click on
Actions
in the github menu. Notice that Github has already set up a workflow to publish our content using jekyll, it should already be available at https://user.github.io/repo.
Using Quarto
In LSC-hubs we’ve selected an alternative to jekyll, called quarto. In order to activate Quarto you need to set a number of items yourself.
- Create a file
_quarto.yml
into the new git repository, with this content:
project:
type: website
website:
title: "hello world"
navbar:
left:
- href: index.md
text: Home
- about.md
format:
html:
theme: cosmo
toc: true
- Remove the existing workflow, generated by Github in
Actions
,Workflows
,Remove
- First you need to allow the workflow-runner to make changes on the repository. For this, open
Settings
,Actions
,General
. Scroll down toWorkflow permissions
. Tick theRead and write permissions
and clickSave
. If the option is grayed out, you first need to allow this feature in your organization. - Then, from
Actions
, selectNew workflow
, thenset up a workflow yourself
. - On the next page we will create a new workflow script, which is stored in the repository at /.github/workflows/main.yml.
name: Docs Deploy
on:
push:
branches:
- main
jobs:
build-deploy:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v3
- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
with:
tinytex: true
path: docs
- name: Publish to GitHub Pages (and render)
uses: quarto-dev/quarto-actions/publish@v2
with:
target: gh-pages
path: docs
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- Save the file, via
actions
you can follow the progress of the workflow at every push to the repository. - On the logs notice how a container is initialised, the source code is checked out, the quarto dependency is installed, the build is made and pushed to the gh-pages branch.
Notice that the syntax to define workflows is different for every CI-CD platform, however they generally follow a similar pattern. For Github identify in the file above:
- It defines at what events the workflow should trigger (in this case at
push
events). - a build job is triggered, which indicates a container image (runs-on) to run the job in, then triggers some steps.
- The final step triggers a facility of quarto to publish its output to a github repository
The above setup is optimal for co-creating a documentation repository for your community. Users can visit the source code via the edit on github
link and suggest improvements via issues of pull requests. Notice that this tutorial is also maintained as markdown in Git.
Update catalogue from GIT CI-CD
For this scenario we need a database in the cloud to host our records (which is reachable by github workflows). For the training we suggest to use a trial account at elephantsql.com.
- At elephantsql, create a new account.
- Then create a new Instance of type
Tiny (free)
. - Click on the instance and notice the relevant connection string (URL) and password
- Connect your instance of pycsw to this database instance, by updating
pycsw.cfg
and following the instructions at Catalogue publication - Verify in elephantsql dashboard if the records are correctly loaded.
We will now publish our records from Github to our database.
- Create a new repository on Github for the records
- Make sure git-scm (or a GUI tool like Git kraken, Smartgit) is intalled on your system.
- Clone (download) the repository to a local folder.
git clone https://github.com/username/records-repo.git
- Copy the mcf files, which have been generated in Catalogue publication, to a
datasets
folder in the cloned repository. - Commit and the files
git add -A && git commit -m "Your Message"
Before you can push your changes to Github, you need to set up authentication, generally 2 options are possible: - Using a personal access token - Or using SSH public key
git push origin main
We’ll now set up CI-CD to publish the records
- Place the pycsw.cfg file in the root of the repository (including the postgres database connection)
- Create a new custom workflow file with this content:
name: Records Deploy
on:
push:
paths:
- '**'
defaults:
run:
working-directory: .
jobs:
build:
name: Build and Deploy Records
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.9
- name: Install dependencies
run: |
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get install gdal-bin
sudo apt-get install libgdal-dev
ogrinfo --version
pip install GDAL==3.4.3
pip install geodatacrawler pycsw sqlalchemy - name: Crawl metadata
run: |
export pgdc_webdav_url=http://localhost/collections/metadata:main/items
export pgdc_canonical_url=https://github.com/pvgenuchten/data-training/tree/main/datasets/
crawl-metadata --dir=./datasets --mode=export --dir-out=/tmp - name: Publish records
run: |
pycsw-admin.py delete-records --config=./pycsw.cfg -y pycsw-admin.py load-records --config=./pycsw.cfg --path=/tmp
- Verify that the records are loaded on pycsw (through postgres)
- Change or add some records to GIT, and verify if the changes are published (may take some time)
Normally, we would not add a connection string to a database in a config file posted on Github. Instead Github offers secrets to capture this type of information.
Cross linking catalogue and GIT
While users are browsing the catalogue (or this page), they may find irregularities in the content. They can flag this as an issue in the relevant Git repository. A nice feature is to add a link in the catalogue page which brings them back to the relevant mcf in the git repository. With proper authorisations they can instantly improve the record, or suggest an improvement via an issue or pull request.
Summary
In this section you learned about using actions in Github (CI/CD). In the next section we are diving into data publication. Notice that you can also use GIT CI/CD mechanisms to deploy or evaluate metadata and data services.