GitLab on Kubernetes

Continuing my GitLab and Kubernetes (k8s) odyssey from k8s @ Debian I’ve learned two things:

  1. DNS problems with Alpine Linux
  2. SSL certificate hell

DNS problems with Alpine Linux

Many Docker images are based on Alpine Linux, which is smaller than most Debian images. But the use a different C library names musl, which is smaller but more restricted then glibc. Among others it uses a different DNS resolver, which sometimes fails for me.

The symptoms are that the GitLab runner fails the initial git clone:

Getting source from Git repository
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/work/XXX/.git/
Created fresh repository.
fatal: unable to access 'https://git.XXX.XXX.de/XXX/XXX.git/': Could not resolve host: git.XXX.XXX.de

Strangely this does not happen every time, but mostly after the cluster was idle for some time. Restarting the failed job again manually most often fixed the problem as then the host was resolvable.

There is a bug, where the initial DNS lookup is delayed for 5s and more. I can trigger this with the following command:

kubectl run -it --rm --restart=Never busybox --image=alpine:3.12 -- nslookup git.XXX.XXX.de
...
;; connection timed out; no servers could be reached
...

Search order

Looking at /etc/resolv.conf you will find this:

nameserver 169.254.25.10
search default.svc.cluster.local svc.cluster.local cluster.local pmhahn.XXX
options ndots:5

The ndots:5 leads to the following (strange) situation: As git.XXX.XXX.de has less then 5 dots, the resolver will try to resolve the following URLs:

  1. git.XXX.XXX.de.default.svc.cluster.local
  2. git.XXX.XXX.de.svc.cluster.local
  3. git.XXX.XXX.de.cluster.local
  4. git.XXX.XXX.de.pmhahn.XXX
  5. git.XXX.XXX.de.

Only the last will succeed, but the other ones will take time.

This should be avoidable by appending a trailing dot to the URL like https://git.XXX.XXX.de./, but did not work for me. Using clone_url = "https://git.XXX.XXX.de./" in config.tomp did not improve the situation.

IPv6

Our environment is IPv6 ready, but not every host has a IPv6 address. Namely our git-server does not have one.

When the name is resolved both IPv4 and IPv6 addresses are queried. The musl DNS resolver seems to perform the in parallel re-using the same UDP socket. There seems to be a known problem with this as the Linux connection tracking code is confused by this. The NXDOMAIN answer for IPv6 is then used also for IPv4 and DNS resolution fails.

NodeLocal DNSCache

I’ve already enabled NodeLocal DNS, which was recommended by this article. Initially this looked promising, but now the problems are back.

My hack

For the GitLab k8s runner there currently is no way to overwrite the ndots:5. Therefore I setup a pre_clone_script to rewrite the /etc/resolv.conf:

envVars:
  - name: RUNNER_PRE_CLONE_SCRIPT
    value: 'd="$(grep ^nameserver /etc/resolv.conf)";echo "$d" >/etc/resolv.conf'

This improved the situation for some time, but again I see failures.

Reviewing the nslookup experiment from above I’ve now extended to this:

envVars:
  - name: RUNNER_PRE_CLONE_SCRIPT
    value: 'd="$(grep ^nameserver /etc/resolv.conf)";echo "$d" >/etc/resolv.conf;for i in 1 2 3 4 5;do getent hosts git.XXX.XXX.de&&break;done'

which is supposed to retry DNS resolution 5 times before the real job starts. Probably this will not work for the helper image, but lets see.

PS: My k8s cluster is using calico CNI.

SSL certificate hell

You should secure your connections by using https://. But this is more complicated if you have a private installation and are using self-signed SSL certificates.

  1. dockerd will eventually pull images from you private registry.
  2. Your GitLab k8s runner needs to communicate with your GitLab server.
  3. Scripts running in your pipeline may want to communicate with those services.

Getting this to work in all situations is a long process:

Docker on host

ERROR: Job failed: image pull failed: Back-off pulling image “docker-registry.XXX.XXX.de/phahn/ucs-minbase:latest”

install -m 0444 ucs-root-ca.crt /usr/local/share/ca-certificates/
update-ca-certificates
systemctl restart docker.service

Docker in docker

To be able to build new Docker images with GitLab k8s runners, which are themselves running as Docker containers, I’m using Docker in Docker (dind). Normally you can directly use the Docker image docker:dind for that, but that one does not have your CA certificate. Therefore I build my own with the following Dockerfile:

FROM docker:dind
COPY ucs-root-ca.crt /usr/local/share/ca-certificates/ca.crt
RUN update-ca-certificates

The image is built by this:

docker build -t docker-registry.XXX.XXX.de/ucs/docker:dind  .

It is used in .gitlab-ci.yaml like this:

.docker:
  services:
    - name: docker-registry.XXX.XXX.de/ucs/docker:dind
      alias: docker
  variables:
    DOCKER_HOST: tcp://docker:2375/
    DOCKER_DRIVER: overlay2
  tags:
    - docker
  image: docker:stable

some job:
  extends: .docker
  script:
    - docker build .

k8s GitLab runner

A k8s GitLab runner consists of multiple containers:

  1. The permanent container registered with GitLab
  2. A container for each build job as specified with image:.
  3. Additional containers for the services as specified with services:.
  4. A helper container to handle Git, artifacts and cache operations.

Previously I was using this in values.yaml:

envVars:
  - name: CI_SERVER_TLS_CA_FILE
    value: /home/gitlab-runner/.gitlab-runner/certs/ucs-root-ca.crt
  - name: CONFIG_FILE
    value: /home/gitlab-runner/.gitlab-runner/config.toml

This worked for cloning and most other operations, but failed for LFS:

LFS: Get https://git.XXX.XXX.de/XXX/XXX/ansible-playbooks.git/gitlab-lfs/objects/e16be6c5cd3e3b7e79dd09b3dd0662cdb4d70df7d1d38e56c3a6504e132e9f06: x509: certificate signed by unknown authority

For the helper you need an additional trick to insert the CA certificate. Using Volumes you can easily do that:

runners:
  config: |
    [[runners]]
      clone_url = "https://git.XXX.XXX.de./"
      [runners.kubernetes]
        image = "docker-registry.XXX.XXX.de/phahn/ucs-minbase:latest"
        imagePullPolicy = "always"
        privileged = true
        [[runners.kubernetes.volumes.secret]]
          name = "ca"
          mount_path = "/etc/gitlab-runner/certs/"
          read_only = true
          [runners.kubernetes.volumes.secret.items]
            "ucs-root-ca.crt" = "ca.crt"

Here ca referred to the k8s secret, which I already use for kubespray:

# kubectl describe secret ca
Name:         ca
Namespace:    default
...
Type:  Opaque
...
Data
====
ucs-root-ca.crt:  2537 bytes

Pipeline

For Debian based images inside the pipeline:

apt-get update -qq
apt-get install -q --assume-yes ca-certificates
install -m 0444 ucs-root-ca.crt /usr/local/share/ca-certificates/
update-ca-certificates

Links

Written on November 11, 2020