Speedup Debian package building
For my employee Univention I want to setup a Continuous Integration system to build Debian packages. We have been using our own build-system called Repo-NG based on pbuilder, which show their age. I tried several tricks to improve the build speed as we have to build many packages and many of them multiple times.
Improving pbuilder speed
So far we have been using pbuilder, which in its initial form uses compressed tape archives
.tar.gz to have clean build environments.
For each build they are extracted to a new location, in which the package build happens.
This is slow when using slow disk and was my first vector for optimization.
Using unsafe dpkg
dpkg has an option to disable its synchronization:
echo "force-unsafe-io" > /etc/dpkg/dpkg.cfg.d/force-unsafe-io
This is already used during the initial setup, but removed after the Debian installer has finished. Enabling it in throw-away build environments is a first optimization.
When we moved Repo-NG from physical servers into virtual machines, we gave each VM a stratch volume using Qemus
This filters our all
sync() calls to flush the data to disk, which greatly improves the time to setup the required build dependencies as
sync() is used a lot during
dpkg installing packages.
An alternative it to use eat-my-data. This works quiet well most of the times, but I remember having problems with some packages, which failed to build. My main concern is that I have to install that library into each build environment, which then is no longer minimal.
Another alternative is using SECCOMP to filter out sync operations. This also works quiet well but some integration tests fail, which notice a discrepancy between requested and actual sync mode. This also breaks when doing cross-compiles, but other has the benefit, that I can setup it outside the build environment.
For our piuparts system I’m using a very large tmpfs file-system backed by lots of swap space. As long as everything fits into RAM it’s ultra-fast as everything happens in RAM only most of the time.
On my personal systems I have been using btrfs with btrfsbuilder.
The idea is to use btrfs built-in feature writeable snapshots instead of using extracting
tar files all the time.
This works well with fast disks, but the work to manage the meta-data costs lots of time:
As btrfs must be prepared to be consistent in case of crashes, it uses
sync() a lot, which makes it slow.
Combining it with one of the other methods to reduce the
sync()-load improves this.
sbuild already describes a setup, where the base system is kept unpacked in the file system and each built gets a
tmpfs overlay for the data.
This sounds like an excellent solution as it combines the best of two worlds:
- no time to unpack the tar file each time
- use the ultra fast tmpfs for all temporary build data
I have not yet investigated this, but it is on my to-do list. In the past btrfs could not be used with overlayfs, but this is fixed since Linux Kernel version 4.15. My idea would be to create a new snapshot for each build in addition to using unionfs, so that I can update the master version any time even while other builds are still running using an old snapshot.
Variants of pbuilder
There is cow-builder, which also uses already extracted build environments.
It then uses an
LD_PRELOAD wrapper to implement
Copy-on-write in user-space, which works quiet well.
Personally I don’t like this approach I feat corruption:
If the library does not intercept all calls correctly or a process by-passes the library, changes might leak back into the master environment and taint all following builds.
Another variant is qemu-builder, which uses Qemu to setup a virtual machine for each build. Qemu has a built-in snapshot feature to protect the master image from changes: All changes go into an overlay image, which can be placed on tmpfs for extra speed. Using Qemu also allows cross-building for other architectures.
pbuilder itself also include LVM builder. It uses the snapshot feature of the Logical Volume Manager built into the Linux kernel. Each base image is setup as a separate logical volume and each build gets its own writeable snapshot.
Multiple things make Docker attractive:
- it is easy to create a clean environment with just a single command
- it already implements caching
If you build many packages multiple times, most of the time goes into preparing the build environment:
- determine, which packages are required
- downloading them
- installing them
- configuring them
This first step to reduce the time needed for that is to create docker images for building packages. For each UCS release I have multiple docker images:
minbaseversion consisting of Essential: yes only. Currently this image is created by doing a
debootstrap --variant=minbase. Doing it this way gets you an image, which still includes lots of extra stuff not needed for Docker. In the feature I want to adapt the approach from slim images.
build-essentialalready installed. This includes the
dpkg-devpackage, which are always required for building Debian packages.
Those images need to be re-build each time one of the packages installed in the image gets an update. It’s on my to-do list to automate this in the future.
The ad-hoc approach for building packages is like this:
docker run --rm -ti -v "$PWD":/work -w /work "$base_image" sh apt-get -qq update # optional: Update Packages files apt-get -qq upgrade # optional: Upgrade already installed packages apt-get -qq build-dep . # Install build dependencies dpkg-buildpackage -uc -ub -b # build package exit 0
The good this about this approach is that you can easily use it inside a build pipeline:
- GitLab (and others) already are based on using Docker images.
- It’s easy to install the required build dependencies.
- You have to find a way to get the build artifacts archived. Using Job artifacts might be one option, but see Repository hosting below for more options.
This naive approach has the following drawbacks:
- you run the build as the user
rootinside the docker container. You have to call
suto create them inside the container to fix this.
- the time and network bandwidth to download and install the dependencies is wasted afterwards. They have to be re-done for each new build.
- the build artifacts are lost as they are put outside your current working directory. You need to setup a second volume for the parent directory to receive the artifacts.
docker run \ --add-host updates.knut.univention.de:192.168.0.10 \ -v "$TMPDIR":/work \ -v "$PWD":/work/src \ -e BUID="$UID" \ --rm -ti "$base_image" sh apt-get -qq update # optional: Update Packages files apt-get -qq upgrade # optional: Upgrade already installed packages apt-get -qq build-dep . # Install build dependencies adduser --home /work/src --shell /bin/bash --uid $BUID --gecos '' --disabled-password --no-create-home build exec su -c 'exec dpkg-buildpackage -uc -us -b -rfakeoot' build
(I need that
--add-host argument as my Docker images use that host name for our Debian package repository, but I run that image on my Notebook behind a VPN, so default DNS resolution of that name does not work.)
An alternative is to use something like the following Dockerfile:
ARG ucs=latest FROM docker-registry.knut.univention.de/phahn/ucs-devbase:$ucs AS build RUN install -o nobody -g nogroup -d /work/src WORKDIR /work/src COPY --chown=nobody:nogroup debian/control debian/control RUN apt-get -qq update && apt-get -qq build-dep . ; apt-get clean USER nobody COPY --chown=nobody:nogroup . . RUN dpkg-buildpackage -uc -us -b -rfakeroot FROM scratch COPY --from=build /work/*.deb /
The idea is to
- Use the previously prepared docker image
- Get the required
debian/controland install them.
- Build the package
- Copy the built binary packages into an empty new Docker image.
I can use it like this from the directory containing my extracted source tree:
docker build --add-host updates.knut.univention.de:192.168.0.10 -t deb -f ~/misc/Docker/Dockerfile.build . docker image save deb | tar -x -f - -O --wildcards '*/layer.tar' | tar tf -
This works quiet well as Dockers built-in caching mechanism is used to cache the build environment.
debian/control does not change there is no need to set it up again in most cases.
What’s missing here is again the tracking of changed packages in that image.
Here’s some old performance data collected from compiling Linux Kernel 3.2.39.
The system had an Intel Core i7 with 8 GiB RAM and a single 500 GiB SATA disk.
A 100 GiB LVM volume was either used as a
ext4 file system or an additional swap space.
The data above was a little bit surprising, since I expected
tmpfs to perform better.
My reading is that only the unpack time improved, because there most operations are performed in RAM.
The compile itself performed worse, since running 8 compilers in parallel needs lots of RAM, which creates memory pressure and forces lots of data to be transferred between RAM and the hard-disk.
This needs more investigation, especially with smaller packages and using less parallelism.
pbuilder shows its age: It’s a collection of shell script having multiple issues.
Despite the name
pbuilder prepares the source package on your host system, for which it installs all the build-dependencies on your host!
You can disable that by using
--use-pdebuild-internal, but this is not the default!
As soon as you use that you get a new bunch of shell quoting errors, as many more things then need to be done inside the
Debian’s official build system uses sbuild, which looks more robust.
Improving repository services
Speed up Package indexing
apt-ftparchive internally with lots of trick to get it up to speed.
Most important is the option
--db to use an database for caching the extracted package data.
Historically we used
dpkg -I to extract that data into a text file and concatenated those fragments into the resulting
This happened over NFS and resulted in too many
stat() calls to implement proper cache invalidation:
The time-stamp for each
.deb had to be compared to the time-stamp of each cache fragment.
We also switched to
-o APT::FTPArchive::AlwaysStat=false to further reduce the number of
The downside of this is that you have to be very careful, that the triple
(pkg-name, pkg-version, pkg-arch) is unique for each package and never re-used for a file with different content.
(Good when your packages are reproducible, our’s are not.)
Another improvement was the use of Python’s os.scandir() instead of
The later does a
stat() on all files to distinguish files from directories, which added another round of
This is very fast while that data is still cached in the Linux kernels dentry cache from the last run, but abyssal slow in the morning after
updatedb trashed that cache.
For previous projects I’ve used reprepro both personally but also in my company to host packages built by CI. Currently I’m investigating the move to Atly, which has a very powerful REST API. This allows it to be used via cURL until GitLab implements its own Debian Package Registry.
I should repeat the compilation test with the different variants.