Over the past few months I’ve become super interested in the container security movement. SELinux and apparmor are incredible LSMs for implementing mandatory access control policies and seccomp (SECure COMPuting with filters) can be added on top of MAC to further limit which system calls are issued. The combination of MAC policies and system call filtering has some amazing potential for admins who want to minimize the attack surface on their Linux servers.
Seccomp policies are enforced in the Linux kernel and are added by way of the prtctl(2) and seccomp(2) system calls. If you have a recent kernel you can check your kernel config file to see it seccomp is enabled:
$ grep SECCOMP /boot/config-$(uname -r)
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
To use this feature to limit which system calls a process can issue you can run strace or perf to generate a list. I wrote a simple bourne shell script named create-seccomp-profiles to assist with this process. The script is an adaptaion of Brendan Gregg’s syscount script and uses perf to record the system calls a process issues. To generate a list of system calls for a specific process you can run the script with the “-l” (list system calls) and “-c” (command to interrogate) options:
$ create-seccomp-profiles -l -c nginx
System calls captured by perf:
socket socketpair bind listen connect
sendto setsockopt sendmsg recvmsg io_setup
eventfd2 epoll_create epoll_ctl epoll_pwait statfs
dup2 poll getdents ioctl fcntl
mkdir unlink newstat newfstat lseek
read pread64 pwrite64 access open
openat close mprotect brk munmap
set_robust_list futex setitimer setgroups setgid
setuid getpid geteuid newuname prlimit64
prctl sysinfo rt_sigprocmask rt_sigaction rt_sigsuspend
exit_group wait4 set_tid_address mmap arch_prctl
To create an initial profile (which will most likely need to be tweaked) to add to a systemd unit file you can run create-seccomp-profiles with the “-s” (create systemd output):
$ create-seccomp-profiles -s -c nginx
SystemCallFilter=socket socketpair bind listen connect setsockopt sendmsg recvmsg io_setup eventfd2 epoll_create epoll_ctl epoll_pwait statfs dup2 getdents ioctl fcntl mkdir unlink newstat newfstat lseek read pread64 pwrite64 access open openat close mprotect brk munmap set_robust_list futex setitimer setgroups setgid setuid getpid geteuid newuname prlimit64 prctl sysinfo rt_sigprocmask rt_sigaction rt_sigsuspend exit_group wait4 set_tid_address mmap arch_prctl
To ensure that you get all of the possible system calls you need to simulate load that matches what you would see on a live server. This will ensure that dynamically loaded modules (ones loaded via dlopen() for example) will load and run. To apply the seccomp policy to a systemd service you can use the systemctl edit option:
$ systemctl edit nginx
[Service]
SystemCallFilter=socket socketpair bind listen connect setsockopt sendmsg recvmsg io_setup eventfd2 epoll_create epoll_ctl epoll_pwait statfs dup2 getdents ioctl fcntl mkdir unlink newstat newfstat lseek read pread64 pwrite64 access open openat close mprotect brk munmap set_robust_list futex setitimer setgroups setgid setuid getpid geteuid newuname prlimit64 prctl sysinfo rt_sigprocmask rt_sigaction rt_sigsuspend exit_group wait4 set_tid_address mmap arch_prctl
The edit option will create an override.conf in /etc/systemd/system/<SERVICE_NAME>.service.d. To reload the service with the new seccomp profile you can use the systemctl daemon-reload and restart options:
$ systemctl daemon-reload
$ systemctl restart nginx
If everything went as planned your service (nginx in this example) should start up and run as usual. If you happened to miss a system call the service will enter the failed state and a message will be written to the journal. This entry can be viewed with journalctl:
$ journalctl -n 20 -l
Nov 26 16:34:14 localhost.localdomain systemd[1]: Starting The nginx HTTP and reverse proxy server...
Nov 26 16:34:14 localhost.localdomain audit[3861]: SECCOMP auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=3861 comm="rm" exe="/usr/bin/rm" sig=31 arch=c000003e syscall=5 compat=0 ip=0x7f39ead623c2 code=0x0
Nov 26 16:34:14 localhost.localdomain audit[3861]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=3861 comm="rm" exe="/usr/bin/rm" sig=31 res=1
Nov 26 16:34:14 localhost.localdomain systemd[1]: Started Process Core Dump (PID 3862/UID 0).
Nov 26 16:34:14 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-3862-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Nov 26 16:34:14 localhost.localdomain audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=nginx comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Nov 26 16:34:14 localhost.localdomain systemd[1]: nginx.service: Control process exited, code=killed status=31
Nov 26 16:34:14 localhost.localdomain systemd[1]: Failed to start The nginx HTTP and reverse proxy server.
Nov 26 16:34:14 localhost.localdomain systemd[1]: nginx.service: Unit entered failed state.
Nov 26 16:34:14 localhost.localdomain systemd[1]: nginx.service: Failed with result 'signal'.
Nov 26 16:34:14 localhost.localdomain systemd-coredump[3863]: Process 3861 (rm) of user 0 dumped core.
Stack trace of thread 3861:
#0 0x00007f39ead623c2 __GI___fxstat (ld-linux-x86-64.so.2)
#1 0x00007f39ead56ecb _dl_sysdep_read_whole_file (ld-linux-x86-64.so.2)
#2 0x00007f39ead5dcd8 _dl_load_cache_lookup (ld-linux-x86-64.so.2)
#3 0x00007f39ead4e5d2 _dl_map_object (ld-linux-x86-64.so.2)
#4 0x00007f39ead536b2 openaux (ld-linux-x86-64.so.2)
#5 0x00007f39ead6137b _dl_catch_error (ld-linux-x86-64.so.2)
#6 0x00007f39ead539fc _dl_map_object_deps (ld-linux-x86-64.so.2)
#7 0x00007f39ead48ce7 dl_main (ld-linux-x86-64.so.2)
#8 0x00007f39ead603d1 _dl_sysdep_start (ld-linux-x86-64.so.2)
#9 0x00007f39ead46f68 _dl_start (ld-linux-x86-64.so.2)
#10 0x00007f39ead45ed8 _start (ld-linux-x86-64.so.2)
Note the SECCOMP entry in the output above. This contains the failure message as well as a syscalls argument indicating which system call wasn’t allowed. To see which system call name the system call number maps to you can run ausyscall with the numeric system call identifier:
$ ausyscall 5
fstat
Once you know the system call that caused the service to fail you can addend it to the SystemCallFilter entry in the systemd unit file override file and restart the process using the steps above. In the process starts up successfully you can check the status proc entry to verify that seccomp is active:
$ grep -i seccomp /proc/34256/status
Seccomp: 2
This entry has one of three values:
The create-seccomp-profiles script is very much a work in progress and I have a few items to tackle over the coming weeks:
This was a good first start and I learned a ton about seccomp while researching this exciting topic.