Debug 101: Linux audit framework
In the UMC-server vs. python-notifier pullcord we had the situation, that an unknown process kept killing other processes from time to time and we had no idea, which process it was.
We tried several things:
- Add code to
/etc/init.d/
scripts to log events. - Add code to the
SIGTERM
signal handler of UMC-server and other services to log its invocation. - Run
ps fax
on a regular basis to find the culprit by “accident”.
This never showed a process forking a kill -SIGTERM $PID
, so we needed a bigger hammer.
strace
strace
can be used to trace the system calls a process is doing.
(System calls are the “doors” between user-space and kernel-space and do the transition from unprivileged user-space code to privileged kernel code.)
- Using
strace "$cmd"
traces the system calls done by command$cmd
. - If you also want to (recursively) trance its child processes you have to use the
-f
argument. You also need this if the process is multi-threaded. - Using
-o "$file"
the output gets written to the named file. For multiple processes you can give-f
twice, so each process gets its own file (with suffix.$PID
); otherwise all system calls of all processes are collected in only that single file. - Using
-p "$PID"
you can attach to an already running process. Only that single process is traced, not its already running child processes. But you can use-f
again to trace newly created child processed. - You can use
-e t=process
to limit tracing to system calls from the “process” category. Other categories like “file” exists as well. - Instead of categories you also can name individual system calls or exclude them, e.g.
-e '!futex'
. - By default only the first 32 characters of strings are logged.
Use
-s 256
to increase that.
audit
The Linux kernel includes an audit framework, which you can think of as strace
on speed.
The Linux Audit framework can do a lot more than system call tracing:
It was designed for CAPP-compliant auditing.
- Log any failed or successful access to a file like
/etc/shadow
. - Intrusion Prevention System can use it to block processes from performing malicious actions.
systemd
logs the start and stop of services through it.- PAM & Co. log important events when users log in or log out or change their permissions.
This even allows to keep track of users using
sudo
orsu
. - SELinux and AppArmor report and log violations via Audit.
First you have to install Debian package auditd
, which for UCS is in “unmaintained”.
ucr set repository/online/unmaintained=yes
apt-get -qq update
apt-get -q –assume-yes install auditd
For the pullcord case we needed to find the process doing a sys_kill
with a1=SIGTERM
.
This is done by adding the following rule to file /etc/audit/rules.d/bug52518.rules
:
-a always,exit -F arch=b64 -S kill -F a1=15 -k Bug52518
This tells the Audit system to always emit an event, when the architecture b64 system call kill is used with argument SIGTERM
(a1=15).
The event is emitted on exit (after the Linux kernel already has executed its code) and should be tagged with key Bug52518
.
In addition to that we also want to track processes doing an sys_exit
or sys_execve
, so we append a second rule:
-a always,exit -F arch=b64 -S exit -S execve -k Bug52518
Here two system calls are used in the same rule, which is more efficient then adding individual rules. (Performance easily becomes a problem as each additional rule slows down all system calls, so the complete system with all of its processes and threads. Anecdote from customer 26: too many rules led to to an release updates taking many hours instead of minutes.)
The individual audit rule snippets must be concatenated to a single file /etc/audit/audit.rules
and loaded into the Linux kernel, which is done by running augenrules --load
.
auditd vs journald
Afterwards the audit events are both logged into /var/log/audit/audit.log
and to “systemd journal”.
To prevent the events being logged twice by journald
and auditd
, disable the first by running systemctl mask systemd-journald-audit.socket
.
Most log events consists of multiple lines, which are linked via their common Audit ID. The events are logged as RAW events, which makes them hard to read. For example the time stamp is in seconds since the UNIX epoch and process titles are printed hex-encoded:
type=SYSCALL msg=audit(1610524812.243:7148742): arch=c000003e syscall=62 success=yes exit=0 a0=1312 a1=f a2=1 a3=20 items=0 ppid=1 pid=5965 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="univention-mana" exe="/usr/
type=OBJ_PID msg=audit(1610524812.243:7148742): opid=4882 oauid=-1 ouid=0 oses=-1 ocomm="univention-mana"
type=PROCTITLE msg=audit(1610524812.243:7148742): proctitle=2F7573722F62696E2F707974686F6E322E37002F7573722F7362696E2F756E6976656E74696F6E2D6D616E6167656D656E742D636F6E736F6C652D736572766572007374617274
ausearch
can be used to work with the log:
--interpret
converts the entries to a more human readable format.- Use
--key Bug52518
to limit the output to those events tagged with the specified key. - You can use
--start 12.01.2021 12:00:00
and--end 13.01.2021 13:00:00
to limit the time frame.
audit spam
In addition to our manually configured system call events many other services emit Audit events as well, which will clutter the log:
systemd
logs the start and stop of services.PAM
logs session start and stop events.
They can be filtered out by using the exclude
list in /etc/audit/rules.d/bug52518.rules
:
-a always,exclude -F msgtype=PROCTITLE -F msgtype=SYSCALL -F msgtype=PATH -F msgtype=CWD -F msgtype=EXECVE -F msgtype=SERVICE_START -F msgtype=SERVICE_STOP -F msgtype=CONFIG_CHANGE
-a never,exclude
We add the named message types to always to white list them, all other types are black listed by catching them with never.
Analysis
Use the above method we were able to capture (among others) the following event:
# ausearch -i -a 7148742 -if audit.log.1
----
type=PROCTITLE msg=audit(13.01.2021 09:00:12.243:7148742) : proctitle=/usr/bin/python2.7 /usr/sbin/univention-management-console-server start
type=OBJ_PID msg=audit(13.01.2021 09:00:12.243:7148742) : opid=4882 oauid=unset ouid=root oses=-1 ocomm=univention-mana
type=SYSCALL msg=audit(13.01.2021 09:00:12.243:7148742) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x1312 a1=SIGTERM a2=0x1 a3=0x20 items=0 ppid=1 pid=5965 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=univention-mana exe=/usr/bin/python2.7 key=Bug_52518
SYSCALL
tells us, that processpid=5965
is dong asyscall=kill
witha1=SIGTERM
on process0x1312
.- From
PROCTITLE
we get the information, that the process doing this kill is/usr/sbin/univention-management-console-server
. - Form
OBJ_PID
we also get the information, that the process being killed ispid=4882=0x1312
: It’s a process running asouid=root
and it’s abbreviated name isocomm=univention-mana
.
Stupidly the name is abbreviated, so we need to lookup the complete name.
As we also enabled tracing sys_execve
we can use that to get the full command line.
This is easily done by running the following command filtering our log for the first event associated with that process ID 4882
:
# ausearch -i -if audit.log.1 -p 4882 --just-one
----
type=PROCTITLE msg=audit(13.01.2021 09:00:12.243:7148745) : proctitle=/usr/bin/python2.7 /usr/sbin/univention-management-console-web-server start
type=SYSCALL msg=audit(13.01.2021 09:00:12.243:7148745) : arch=x86_64 syscall=exit a0=EXIT_SUCCESS a1=0x7ff2669099c0 a2=0x3c a3=0x7c0 items=0 ppid=1 pid=4882 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=univention-mana exe=/usr/bin/python2.7 key=Bug_52518
Et voilà:
We only need proctitle
to see, that umc-server
is killing umc-**web**-server
.
Those two processes are unrelated, as they have no immediate parent/child association.
Investigating other usages of sys_kill
by using the same technique showed umc-server
to also kill many other processes like s4c
, univention-portal
and all other services using /usr/bin/python2.7
.
gdb
Next we did a source code audit of umc-server
, but found “nothing”:
We only were able to construct an obscure case, were the new multi-process setup used os.kill()
, but that mode was not enabled at both customers.
At least we know where to look and next used gdb
to finally nail the pin:
Using gdb -p $(pgrep -f univention-management-console-server)
we attached to the running process.
Next we told the debugger to break on the process using system call 62 (sys_kill
) iself by doing a catch syscall 62
.
We also want to cache the process being killed by SIGTERM
, so we also trap on that using catch signal 15
.
After resuming the program with continue
we only had to wait some more to be interrupted again.
Thread 1 "univention-mana" hit Catchpoint 1 (signal SIGTERM), 0x00007fb21d8f5317 in kill () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.
Running backtrace
immediately showed the C-level traceback, but you can even get more information when you install the Debian package python-dbg
:
It does not only provide the “debugging symbols” for the Python interpreter, which map the machine code back to source code lines, but also extend gdb
with some useful commands to debug Python programs from within gdb
itself.
You can get a Python traceback by using py-bt
:
(gdb) py-bt
Traceback (most recent call first):
File "/usr/lib/python2.7/dist-packages/notifier/popen.py", line 270, in __killall
os.kill(pid, signal)
File "/usr/lib/python2.7/dist-packages/notifier/__init__.py", line 104, in __call__
return self._function(*tmp, **self._kwargs)
File "/usr/lib/python2.7/dist-packages/notifier/nf_generic.py", line 259, in step
if not timer[CALLBACK]():
File "/usr/lib/python2.7/dist-packages/notifier/nf_generic.py", line 304, in loop
step()
File "/usr/sbin/univention-management-console-server", line 232, in run
notifier.loop()
File "/usr/lib/python2.7/dist-packages/daemon/runner.py", line 186, in _start
self.app.run()
File "/usr/lib/python2.7/dist-packages/daemon/runner.py", line 267, in do_action
func(self)
File "/usr/sbin/univention-management-console-server", line 285, in
umc_daemon.do_action()
Nailed it:
umc-server
is using python-notifier, where its module for handling sub processes is killing the wrong “children”.
And you also can have a look at Python local variables using py-locals
(reformatted for improved readability):
self = <RunIt(
_RunIt__stderr=None,
stdout=None,
_cmd=[
'/usr/bin/python2.7',
'/usr/sbin/univention-management-console-module',
'-m', 'updater',
'-s', '/var/run/univention-management-console/613-1610957913744.socket',
'-d', '2',
'-l', 'de_DE.UTF-8'],
binary='/usr/bin/python2.7',
_Process__dead=False,
_Provider__signals={
'finished': ,
'killed': <Signal(_Signal__callbacks=[], name='killed') at remote 0x7fb1fc4e9890>},
pid=23255,
_name='python2.7',
_Process__kill_timer=101,
_RunIt__stdout=None,
stderr=None,
child=None,
stopping=True,
_shell=False) at remote 0x7fb1fc4ee990>
signal = 15
unify_name =
appname = 'usrbinpython27'
cmdline_filenames = [
'/proc/1-5,7-29,41-42,74-79,82-83,85,89,91,105,107,109-110,112,115,120-121,124-125/cmdline',
'...(truncated)
cmdline_filename = '/proc/613/cmdline'
fd =
cmdline = '/usr/bin/python2.7\x00/usr/sbin/univention-management-console-server\x00start\x00'
pid = 61