Using docker volumes on SELinux-enabled servers

This article was posted by Matty on 2017-09-30 12:21:00 -0400 -0400

I was doing some testing this week and received the following error when I tried to access a volume inside a container:

$ touch /haproxy/i

touch: cannot touch 'i': Permission denied

When I checked the system logs I saw the following error:

Sep 28 18:40:23 kub1 audit[8881]: AVC avc: denied { write } for pid=8881 comm="touch" name="haproxy" dev="sda1" ino=655362 context=system_u:system_r:container_t:s0:c324,c837 tcontext=unconfined_u:object_r:default_t:s0 tclass=dir permissive=0

The docker container was started with the “-v” option to bind mount a directory from the host:

$ docker run -d -v /haproxy:/haproxy --restart unless-stopped

The error shown above was generated because I didn’t tell my orchestration tool to apply an SELinux label to the volume I was trying to map into the container. In the SELinux world processes and file system objects are given contexts to describe their purpose. These contexts are then used by the kernel to allow processes to access file objects if policy allows it. To allow a docker container to access a volume on a SELinux-enabled host you need to attach the “z” or “Z” flag to the volume mount. These flags are thoroughly described in the docker-run manual page:

“To change a label in the container context, you can add either of two suffixes :z or :Z to the volume mount. These suffixes tell Docker to relabel file objects on the shared volumes. The z option tells Docker that two containers share the volume content. As a result, Docker labels the content with a shared content label. Shared volume labels allow all containers to read/write content. The Z option tells Docker to label the content with a private unshared label. Only the current container can use a private volume.”

When I added the “Z” suffix to the volume everything worked as expected:

$ docker run -d -v /haproxy:/haproxy:Z --restart unless-stopped

My haproxy instances fired up, life was grand and the haproxy containers were distributing connections to my back-end servers. Then a question hit me. How does this work under the covers? I started reading and came across two (one & two) excellent posts by Dan Walsh. When a container starts the processes comprising that container will be labeled with an SELinux context. You can run ‘ps -eZ’ or ‘docker inspect …’ to view the context of a container:

$ docker run --name gorp --rm -it -v /foo:/foo fedora:26 /bin/sh /bin/sh.distrib

$ docker inspect -f '{{ .ProcessLabel }}' gorp

system_u:system_r:container_t:s0:c31,c878

$ ps -eZ | grep (docker inspect -f '{{ .State.Pid }}' gorp)

system_u:system_r:container_t:s0:c31,c878 20197 pts/5 00:00:00 sh

In order for the process to be able to write to a volume the volume needs to be labeled with a SELinux context that the process context has access to. This is the the purpose of the ‘[zZ]’ flags. If you start a container without the z flag you will receive a permission denied error because the SELinux volume level and the process level don’t match (you can read more about levels here). This may be easier to illustrate with an example. If I start a docker command and mount a volume without the “z” flag we can see that the SELinux levels are different:

$ docker run --name gorp --rm -it -v /foo:/foo fedora:26 /bin/sh /bin/sh.distrib

$ docker inspect -f '{{ .ProcessLabel }}' gorp

system_u:system_r:container_t:**s0:c21,c30**

$ ls -ladZ /foo

drwxr-xr-x. 2 root root
system_u:object_r:container_file_t:s0:c135,c579 4096 Sep 29
12:22 /foo

If we tell docker to label the volume with the correct SELinux context prior to performing the bind mount the levels are updated to allow the container process to access the volume. Here is another example:

$ docker run --name gorp --rm -it -v /foo:/foo:Z fedora:26 /bin/sh /bin/sh.distrib

$ docker inspect -f '{{ .ProcessLabel }}' gorp

system_u:system_r:container_t:**s0:c126,c135**

$ ls -ladZ /foo

drwxr-xr-x. 2 root root
system_u:object_r:container_file_t:s0:c126,c135 4096 Sep 30 10:42 /foo

The contexts that apply to docker are defined in the lxc_contexts file:

$ cat /etc/selinux/targeted/contexts/lxc_contexts

process = "system_u:system_r:container_t:s0"
content = "system_u:object_r:virt_var_lib_t:s0"
file = "system_u:object_r:container_file_t:s0"
ro_file="system_u:object_r:container_ro_file_t:s0"
sandbox_kvm_process = "system_u:system_r:svirt_qemu_net_t:s0"
sandbox_kvm_process = "system_u:system_r:svirt_qemu_net_t:s0"
sandbox_lxc_process = "system_u:system_r:container_t:s0"

It’s really interesting how these items are stitched together. You can read more about how this works here and here. You can also read Dan’s article describing why it’s important to leave SELinux enabled.

Which file descriptor (STDOUT, STDERR, etc.) is my application writing to?

This article was posted by Matty on 2017-09-29 09:07:00 -0400 -0400

When developing ansible playbooks a common pattern is to run a command and use the output in a future task. Here is a simple example:

---
- hosts: localhost
  connection: local
  tasks:
  - name: Check if mlocate is installed
    command: dnf info mlocate
    register: mlocate_output

  - name: Update the locate database
    command: updatedb
    when: '"No matching Packages to list" in mlocate_output.stderr'

In the first task dnf will run and the output from the command will be placed in either STDOUT or STDERR. But how do you know which one? One way is to add a debug statement to your playbook:

---
- hosts: localhost
  connection: local
  tasks:
  - name: Check if mlocate is installed
    command: dnf info mlocate
    register: mlocate_output

  - name: Print the contents of mlocate_output
    debug:
      var: mlocate_output

Once the task runs you can view the stderr and stdout fields to see which of the two is populated:

TASK [Print the contents of mlocate_output]
ok: [localhost] => {
"mlocate_output": {
"changed": true,
"cmd": [
"dnf",
"info",
"mlocate"
],
"delta": "0:00:31.239145",
"end": "2017-09-27 16:39:46.919038",
"rc": 0,
"start": "2017-09-27 16:39:15.679893",
"stderr": "",
"stderr_lines": [],
"stdout": "Last metadata expiration check: 0:43:16 ago on Wed 27 Sep 2017 03:56:05 PM EDT.nInstalled PackagesnName : mlocatenVersion : 0.26nRelease : 16.fc26nArch : armv7hlnSize : 366 knSource : mlocate-0.26-16.fc26.src.rpmnRepo : @SystemnFrom repo : fedoranSummary : An utility for finding files by namenURL : https://fedorahosted.org/mlocate/nLicense : GPLv2nDescription : mlocate is a locate/updatedb implementation. It keeps a databasen : of all existing files and allows you to lookup files by name.n : n : The 'm' stands for "merging": updatedb reuses the existingn : database to avoid rereading most of the file system, which makesn : updatedb faster and does not trash the system caches as much asn : traditional locate implementations.",
.....

In the output above we can see that stderr is empty and stdout contains the output from the command. While this works fine it requires you to write a playbook and wait for it to run to get feedback. Strace can provide the same information and in most cases is a much quicker. To get the same information we can pass the command as as argument to strace and limit the output to just write(2) system calls:

$ strace -yy -s 8192 -e trace=write dnf info mlocate

.....
write(1, "Description : mlocate is a locate/updatedb implementation. It keeps a database ofn : all existing files and allows you to lookup files by name.n : n : The 'm' stands for "merging": updatedb reuses the existing database to avoidn : rereading most of the file system, which makes updatedb faster and does notn : trash the system caches as much as traditional locate implementations.", 442Description : mlocate is a locate/updatedb implementation. It keeps a database of
.....

The first argument to write(2) is the file descriptor being written to. In this case that’s STDOUT. This took less than 2 seconds to run and by observing the first argument to write you know which file descriptor the application is writing to.

Working around the ansible "python2 yum module is needed for this module" error

This article was posted by Matty on 2017-09-27 15:59:00 -0400 -0400

During a playbook run I was presented with the following error:

failed: [localhost] (item=[u'yum']) => {"failed": true, "item": ["yum"], "msg": "python2 yum module is needed for this module"}

The role that was executing had a task similar to the following:

- name: Install rsyslog packages
  yum: pkg={{item}} state=installed update_cache=false
  with_items:
    - rsyslog
  notify: Restart rsyslog service

The OS on the system I was trying to update was running Fedora 26 which uses the dnf package manager. Dnf is built on top of Python3 and Fedora 26 no longer includes the yum Python 2 bindings by default (if you want to use the ansible yum module you can create a task to install the yum package). Switching the task to use package instead of yum remedied this issue. Here is the updated task:

- name: Install rsyslog packages
  package: pkg={{item}} state=installed
  with_items:
    - rsyslog
  notify: Restart rsyslog service

The issue was easy to recognize after reading through the yum module source code. Posting this here in case it helps others.

The subtle differences between the docker ADD and COPY commands

This article was posted by Matty on 2017-09-24 08:48:00 -0400 -0400

This weekend I spent some time cleaning up a number of Dockerfiles and getting them integrated into my build system. Docker provides the ADD and COPY commands to take the contents from a given source and copy them into your container. On the surface both commands appear to do the same thing but there is one slight difference. The COPY command works solely on files and directories:

The ADD instruction copies new files, directories or remote file URLs from and adds them to the file system of the image at the path .

While the ADD commands supports files, directories AND remote URLs:

The ADD instruction copies new files, directories or remote file URLs from and adds them to the files ystem of the image at the path .

The additional feature provided by ADD allows you to retrieve remote resources and stash them in your container for use by your applications:

ADD http://prefetch.net/path/to/stuff /stuff

Some of the Dockerfiles I’ve read through on github have done some extremely interesting things with ADD and remote resource retrieval. Noting this here for future reference.

Creating XFS file systems on older Linux kernels

This article was posted by Matty on 2017-09-23 10:34:00 -0400 -0400

This past week I created a new XFS file system to test a feature and received the following error when I tried to mount it:

$ mount /dev/mapper/lv01 /disks/lv01

mount: wrong fs type, bad option, bad superblock on /dev/mapper/lv01,
missing codepage or helper program, or other error

The file system was created with the default options and the output of xfs_info looked good:

meta-data=/dev/mapper/lv01 isize=256 agcount=4, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=12800, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

The system had an existing XFS file system mounted so I decided to compare the metadata from the working file system with the one I just created. This showed one subtle differences:

meta-data=/dev/mapper/vg-var isize=256 agcount=4, agsize=131072 blks
= sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0
data = bsize=4096 blocks=524288, imaxpct=25 = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

The working file system had the crc and ftype flags set to 0 while the non-working file system had crc set to 0 and ftype set to 1 (these are the defaults for the version of xfsprogs I was using). Re-creating the file system with ftype=0 allowed the file system to mount. I’m not one to sit back and enjoy a victory and move on. I wanted to know WHY it failed and the implications of setting ftype to 0. The mkfs.xfs manual page provides the following description for the ftype flag:

ftype=value This feature allows the inode type to be stored in the directory structure so that the readdir(3) and getdents(2) do not need to look up the inode to determine the inode type.

The value is either 0 or 1, with 1 signifiying that filetype information will be stored in the directory structure. The default value is 0.

When CRCs are enabled via -m crc=1, the ftype functionality is always enabled. This feature can not be turned off for such filesystem configurations.

I took this information and started to dig around to see when this feature was added to the Linux kernel. In Linux kernel 3.10 the XFS file system team added the foundation to perform CRC32 metadata checksums. You can read a great write up on this on the LKML. In Linux kernel 3.13 Ben Myers submitted a number of XFS patches which included the “add the inode directory type support to XFS_IOC_FSGEOM” change. This change added support to allow the inode type to be stored in the structure used to represent a directory in XFS.

The system I was running contained a version of xfsprogs that supported this feature but the xfs kernel module didn’t. Hence the file system creation worked but the kernel barfed when it tried to parse the file system metadata (the Linux kernel xfs_mount.c source was extremely useful for understanding what the error meant). Fortunately newer versions of xfsprogs enable CRC metadata checksumming by default which is a nice step towards achieving what ZFS introduced to the world several years ago. If you are interested in learning more about XFS I would highly recommend adding XFS Filesystem Structure to your reading queue.