Faster Debian packages indexing
You can install Debian packages using dpkg -i $pkg.deb
, but this low-level tool does not resolve inter-package dependencies.
This is the job of APT, the Advanced Packaging Tool.
It usually works on a set of packages, which are shipped in a package repository.
This has a Packages
file, which lists all binary packages included in the repository.
Basically it contains the concatenated package meta data from all packages.
Tools
dpkg
Basically you can get the meta data from dpkg -I $pkg.deb
or dkg -s $pkg
.
For multiple packages you have to iterate that for each package.
dpkg-scanpackages
There even is dpkg-scanpackages, which does this by scanning the given directory recursively.
In addition it supports the so called override files:
They are needed to overwrite the meta data of packages after a package has already been built.
This usually happens with new Debian releases, when the package did not change but needs updated meta data.
So instead of just re-rebuilding the package just for that the Section
or Priority
can be changed easily.
For source packages there is dpkg-scansources
, which is used to generates the Sources
files.
apt-ftparchive
apt-ftparchive is an improved version, which can do much more.
In combines dpkg-scanpackages
with dpkg-scansources
, but also can generate the Releases
file.
In addition to all those Packages
and Sources
files it may also list other files, for example translation or icon files.
The file is often associated wit a Releases.gpg
file containing the GnuPG signature required for checking the security chain.
Newer releases are using a InRelease
files, which contains the signature inline to allow atomic testing.
Caching
On top of the apt-ftparchive
also adds caching.
Basically all tools from above still use dpkg -I
to do the heavy listing.
Forking such a process for a large number of packages will make this inefficient very fast.
So the cache stores that data for each path. When next time the index is re-built, the meta data from the cache is used if available.
The cache uses a simple Berkeley database.
The file can be specified with -d
/ --db
/ -o APT::FTPArchive::DB
.
Uniqueness
For Debian package repositories there is a very important rule:
In ideal world, tuple (architecture, name, version) should identify unique package.
This allows tools to uniquely identify packages using that triple, which form the package file name: ${name}_${version}_${architecture}.${tyype}
Many tools use this invariant and break badly if the name is reused for a file with different content.
This easily happens if a package is re-built, bit is not reproducible.
In that case apt-ftparchive
would re-used the old meta-data for the new content.
This usually breaks when such a package is first downloaded by a client as then the file checksums no longer match!
Invalidation
This can be solved in two ways:
- Make sure to never re-use the triple respective the filename.
- Invoke
apt-ftpachive
with-o APT::FTPArchive::AlwaysStat=true
: This will store the package files modification time-stamp with the cache entry. The cached entry is then only used if the files time-stamp is still the same.
Performance issues
Enabling AlwaysStat
may create a performance problem:
For each binary package file apt-ftparchive
now needs to do a stat()
call to get the I-node information.
The Linux kernel will cache that data internally, but doing this for 58k with a cold cache will take some time.
This get much worse if you do this over NFS as there each stat()
call takes a round-trip to the server.
Without AlwaysStat
only a single listdir()
call per directory should be needed.
Everything else only requires looking up the returned file names in the cache, which mostly happens in memory.
File Tree Walk
Actually you will notice that apt-ftparchive
performs abyssal in the cold-cache case.
It still does stat()
calls for all files even when AlwayStat
is disabled.
The culprit here is [ftw()](man:ftw(3)], the C-library used to implement the File Tree Walk:
To walk the directory recursively it must check the type of the returned directory entry:
- For a directory it must recurs.
- For a file it must do the lookup thing.
- Symbolic links may be skipped or followed.
- Other types like device files, UNIX sockets, named PIPEs should be ignored.
d_type
You will find using find -type f
performing a lot better even in the cold-cache case.
Normally a directory entry just maps the name to the i-node number.
But actually most Linux file systems nowadays implement a performance optimization:
The also store the i-node type directly within the directory entry itself.
This is then returned by readdir() as d_type
.
If present the call to lstat() can be skipped.
Combining this with AlwaysStat=false
makes apt-ftparchive
really fast.
But as ftw()
is used internally you have to replace this with your own find
.
Fast apt-ftparchive
Instead you can give apt-ftparchive
list of files to prevent it from using ftw()
itself.
But this does not work with packages
and sources
, but only with generate
.
Depending on the file type you specify those files using FileList
and SourceFileList
.
They work in the sections TreeDefault
, Tree
and BinDirectory
.
Build file lists
Depending on your desired output format you have to separate the files by architecture and (micro) type manually:
find amd64 all source -maxdepth 1 \
-name \*_amd64.deb -fprint .files/amd64.apt -fprint .files/amd64.deb , \
-name \*_amd64.udeb -fprint .files/amd64.apt -fprint .files/amd64.udeb , \
-name \*_all.deb -fprint .files/all.apt -fprint .files/amd64.deb , \
-name \*_all.udeb -fprint .files/all.apt -fprint .files/amd64.udeb , \
-name \*.dsc -fprint .files/source.dsc
Common configuration
Put this in and the following sections into a dist.conf
file:
Dir {
ArchiveDir ".";
OverrideDir ".override/";
CacheDir ".cache/";
FileListDir ".files/";
};
It is te be used with apt-ftparchive release dist.conf
then.
You have to manually create some files:
mkdir -p dists/dist/main/binary-amd64
mkdir -p dists/dist/main/source
mkdir -p dists/dist/main/debian-installer/binary-amd64
Build flat Packages file
Build a Packages
or Sources
file to be included with deb [trusted=yes] file:///.../ amd64/
and deb ... all/
:
BinDirectory "amd64" {
// InternalPrefix "<PREFIX>/";
// BinOverride "";
// SrcOverride "";
// ExtraOverride "";
// SrcExtraOverride "";
Packages "amd64/Packages";
BinCacheDB "db.amd64";
FileList "amd64.apt";
};
BinDirectory "all" {
Packages "all/Packages";
BinCacheDB "db.all";
FileList "all.apt";
};
BinDirectory "source" {
Sources "source/Sources";
SrcCacheDB "db.source";
FileList "source.dsc";
};
Build dists Packages file
Build a Packages
or Sources
file to be included with deb [trusted=yes] file:///.../ dist main
:
TreeDefault {
BinCacheDB "db.$(ARCH)";
SrcCacheDB "db.$(ARCH)";
FileList "$(ARCH).deb";
SourceFileList "$(ARCH).dsc";
};
Tree "dists/dist" {
Sections "main"; // contrib non-free
Architectures "amd64 source";
SrcDirectory "source/";
Packages::Extensions ".deb";
};
Tree "dists/dist/" {
Sections "main"; // contrib non-free
Architectures "amd64";
FileList "$(ARCH).udeb";
Packages::Extensions ".udeb";
Packages "$(DIST)/$(SECTION)/debian-installer/binary-$(ARCH)/Packages";
};
Summary
- Do not use
ftw()
on large directories. - In Python use scandir() instead of listdir().
apt-ftparchive
still does a lot ofreadlink()
calls, which need more investigation.- The cache lookup used the path as given; make sure to not prefix it with
./
only in same cases as this leads to duplicate cache entries.
Appendix
There are some other knobs for tuning:
Hash algorithms
Calculating the different hash sums takes time. You can enable / disable them individually by specifying the following options:
apt::ftparchive::md5 "<BOOL>";
apt::ftparchive::sha1 "<BOOL>";
apt::ftparchive::sha256 "<BOOL>";
apt::ftparchive::sha512 "<BOOL>";
This can also be configured for Packages
and Sources
individually:
apt::ftparchive::packages::<ALGO> "<BOOL>";
apt::ftparchive::sources::<ALGO> "<BOOL>";
Compression formats
You can also configure the compression formats:
- The uncompressed files
.gzip
.bzip2
.lz4
.lzma
.zstd
.xz
Contents
Each binary package ships directories and files.
You can get their paths from dpkg -c $pkg.deb
or dpkg -L $pkg
for a single package.
These paths are collected in the file Contents
.
This is useful is you want to know which packages ship which files.
Extracting this data, storing it in the cache, putting it into the file and compressing it takes some time.
This must be enabled explicitly with --contents
or -o apt::ftparchive::contents=true
.
Internals: Cache format
The format of the cache file is an internal detail of apt-ftparchive
.
Normally you should not use it yourself, but knowing the format helps with debugging.
The Berkeley database contains several entries per file path.
Their key is build by appending a :
and a two-letter code depending on the type to the path:
-
:st
(stat): File statistics for all files like.deb
,.udeb
,.dsc
,.orig.tar.gz
,.debian.diff.gz
,.debian.tar.gz
. It uses the file time of last modificationmtime
for cache validation. The file sizesize
and calculated hashes usingmd5
,sha1
,sha256
andsha512
are stored within. Additionalflags
indicate, if other database records exists.Be aware that this entry is architecture and version dependant! As of 2021 it is 152 bytes for
x86_64
. -
:cl
(Control): Debian binary package control data for binary packages.deb
and.udeb
. This is used to build thePackages
files. -
:cn
(Content): Debian binary package content list of files. This is used to build theContents
files. -
:cs
(Source): Debian source package control data.dsc
. This is used to build theSources
files.