Using docker volumes on SELinux-enabled servers

I was doing some testing this week and received the following error when I tried to access a volume inside a container:

$ touch /haproxy/i
touch: cannot touch ‘i’: Permission denied

When I checked the system logs I saw the following error:

Sep 28 18:40:23 kub1 audit[8881]: AVC avc:  denied  { write } for pid=8881 comm="touch" name="haproxy" dev="sda1" ino=655362 context=system_u:system_r:container_t:s0:c324,c837 tcontext=unconfined_u:object_r:default_t:s0 tclass=dir permissive=0

The docker container was started with the “-v” option to bind mount a directory from the host:

$ docker run -d -v /haproxy:/haproxy –restart unless-stopped haproxy:1.7.9

The error shown above was generated because I didn’t tell my orchestration tool to apply an SELinux label to the volume I was trying to map into the container. In the SELinux world processes and file system objects are given contexts to describe their purpose. These contexts are then used by the kernel to allow processes to access file objects if policy allows it. To allow a docker container to access a volume on a SELinux-enabled host you need to attach the “z” or “Z” flag to the volume mount. These flags are thoroughly described in the docker-run manual page:

"To change a label in the container context, you can add either of two suffixes :z or :Z to the volume mount. These suffixes tell Docker to relabel file objects on the shared volumes. The z option tells Docker that two containers share the volume content. As a result, Docker labels the content with a shared content label. Shared volume labels allow all containers to read/write content.  The Z option tells Docker to label the content with a private unshared label. Only the current container can use a private volume."

When I added the “Z” suffix to the volume everything worked as expected:

$ docker run -d -v /haproxy:/haproxy:Z –restart unless-stopped haproxy:1.7.9

My haproxy instances fired up, life was grand and the haproxy containers were distributing connections to my back-end servers. Then a question hit me. How does this work under the covers? I started reading and came across two (one & two) excellent posts by Dan Walsh. When a container starts the processes comprising that container will be labeled with an SELinux context. You can run ‘ps -eZ’ or ‘docker inspect …’ to view the context of a container:

$ docker run –name gorp –rm -it -v /foo:/foo fedora:26 /bin/sh

$ docker inspect -f ‘{{ .ProcessLabel }}’ gorp
system_u:system_r:container_t:s0:c31,c878

$ ps -eZ | grep $(docker inspect -f ‘{{ .State.Pid }}’ gorp)
system_u:system_r:container_t:s0:c31,c878 20197 pts/5 00:00:00 sh

In order for the process to be able to write to a volume the volume needs to be labeled with a SELinux context that the process context has access to. This is the the purpose of the ‘[zZ]’ flags. If you start a container without the z flag you will receive a permission denied error because the SELinux volume level and the process level don’t match (you can read more about levels here). This may be easier to illustrate with an example. If I start a docker command and mount a volume without the “z” flag we can see that the SELinux levels are different:

$ docker run –name gorp –rm -it -v /foo:/foo fedora:26 /bin/sh

$ docker inspect -f ‘{{ .ProcessLabel }}’ gorp
system_u:system_r:container_t:s0:c21,c30

$ ls -ladZ /foo
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0:c135,c579 4096 Sep 29 12:22 /foo

If we tell docker to label the volume with the correct SELinux context prior to performing the bind mount the levels are updated to allow the container process to access the volume. Here is another example:

$ docker run –name gorp –rm -it -v /foo:/foo:Z fedora:26 /bin/sh

$ docker inspect -f ‘{{ .ProcessLabel }}’ gorp
system_u:system_r:container_t:s0:c126,c135

$ ls -ladZ /foo
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0:c126,c135 4096 Sep 30 10:42 /foo

The contexts that apply to docker are defined in the lxc_contexts file:

$ cat /etc/selinux/targeted/contexts/lxc_contexts
process = “system_u:system_r:container_t:s0”
content = “system_u:object_r:virt_var_lib_t:s0”
file = “system_u:object_r:container_file_t:s0″
ro_file=”system_u:object_r:container_ro_file_t:s0”
sandbox_kvm_process = “system_u:system_r:svirt_qemu_net_t:s0”
sandbox_kvm_process = “system_u:system_r:svirt_qemu_net_t:s0”
sandbox_lxc_process = “system_u:system_r:container_t:s0”

It’s really interesting how these items are stitched together. You can read more about how this works here and here. You can also read Dan’s article describing why it’s important to leave SELinux enabled.

Which file descriptor (STDOUT, STDERR, etc.) is my application writing to?

When developing ansible playbooks a common pattern is to run a command and use the output in a future task. Here is a simple example:

---
- hosts: localhost
  connection: local
  tasks:
    - name: Check if mlocate is installed
      command: dnf info mlocate
      register: mlocate_output

    - name: Update the locate database
      command: updatedb
      when: '"No matching Packages to list" in mlocate_output.stderr'

In the first task dnf will run and the output from the command will be placed in either STDOUT or STDERR. But how do you know which one? One way is to add a debug statement to your playbook:

---
- hosts: localhost
  connection: local
  tasks:
    - name: Check if mlocate is installed
      command: dnf info mlocate
      register: mlocate_output

    - name: Print the contents of mlocate_output
      debug:
        var: mlocate_output

Once the task runs you can view the stderr and stdout fields to see which of the two is populated:

TASK [Print the contents of mlocate_output] ************************************************************************
ok: [localhost] => {
    "mlocate_output": {
        "changed": true, 
        "cmd": [
            "dnf", 
            "info", 
            "mlocate"
        ], 
        "delta": "0:00:31.239145", 
        "end": "2017-09-27 16:39:46.919038", 
        "rc": 0, 
        "start": "2017-09-27 16:39:15.679893", 
        "stderr": "", 
        "stderr_lines": [], 
        "stdout": "Last metadata expiration check: 0:43:16 ago on Wed 27 Sep 2017 03:56:05 PM EDT.\nInstalled Packages\nName         : mlocate\nVersion      : 0.26\nRelease      : 16.fc26\nArch         : armv7hl\nSize         : 366 k\nSource       : mlocate-0.26-16.fc26.src.rpm\nRepo         : @System\nFrom repo    : fedora\nSummary      : An utility for finding files by name\nURL          : https://fedorahosted.org/mlocate/\nLicense      : GPLv2\nDescription  : mlocate is a locate/updatedb implementation.  It keeps a database\n             : of all existing files and allows you to lookup files by name.\n             : \n             : The 'm' stands for \"merging\": updatedb reuses the existing\n             : database to avoid rereading most of the file system, which makes\n             : updatedb faster and does not trash the system caches as much as\n             : traditional locate implementations.", 
.....

In the output above we can see that stderr is empty and stdout contains the output from the command. While this works fine it requires you to write a playbook and wait for it to run to get feedback. Strace can provide the same information and in most cases is a much quicker. To get the same information we can pass the command as as argument to strace and limit the output to just write(2) system calls:

$ strace -yy -s 8192 -e trace=write dnf info mlocate
.....
write(1, "Description  : mlocate is a locate/updatedb implementation.  It keeps a database of\n             : all existing files and allows you to lookup files by name.\n             : \n             : The 'm' stands for \"merging\": updatedb reuses the existing database to avoid\n             : rereading most of the file system, which makes updatedb faster and does not\n             : trash the system caches as much as traditional locate implementations.", 442Description  : mlocate is a locate/updatedb implementation.  It keeps a database of
.....

The first argument to write(2) is the file descriptor being written to. In this case that’s STDOUT. This took less than 2 seconds to run and by observing the first argument to write you know which file descriptor the application is writing to.

Working around the ansible “”python2 yum module is needed for this module” error

During a playbook run I was presented with the following error:

failed: [localhost] (item=[u'yum']) => {"failed": true, "item": ["yum"], "msg": "python2 yum module is needed for this  module"}

The role that was executing had a task similar to the following:

- name: Install rsyslog packages
  yum: pkg={{item}} state=installed update_cache=false
  with_items:
    - rsyslog
  notify: Restart rsyslog service

The OS on the system I was trying to update was running Fedora 26 which uses the dnf package manager. Dnf is built on top of Python3 and Fedora 26 no longer includes the yum Python 2 bindings by default (if you want to use the ansible yum module you can create a task to install the yum package). Switching the task to use package instead of yum remedied this issue. Here is the updated task:

- name: Install rsyslog packages
  package: pkg={{item}} state=installed 
  with_items:
    - rsyslog
  notify: Restart rsyslog service

The issue was easy to recognize after reading through the yum module source code. Posting this here in case it helps others.

The subtle differences between the docker ADD and COPY commands

This weekend I spent some time cleaning up a number of Dockerfiles and getting them integrated into my build system. Docker provides the ADD and COPY commands to take the contents from a given source and copy them into your container. On the surface both commands appear to do the same thing but there is one slight difference. The COPY command works solely on files and directories:

The ADD instruction copies new files, directories or remote file URLs from  and adds them to the file system of the image at the path <dest>.

While the ADD commands supports files, directories AND remote URLs:

The ADD instruction copies new files, directories or remote file URLs from  and adds them to the files ystem of the image at the path <dest>.

The additional feature provided by ADD allows you to retrieve remote resources and stash them in your container for use by your applications:

ADD http://prefetch.net/path/to/stuff /stuff

Some of the Dockerfiles I’ve read through on github have done some extremely interesting things with ADD and remote resource retrieval. Nothing this here for future reference.

Creating XFS file systems on older Linux kernels

This past week I created a new XFS file system to test a feature and received the following error when I tried to mount it:

$ mount /dev/mapper/lv01 /disks/lv01
mount: wrong fs type, bad option, bad superblock on /dev/mapper/lv01,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

The file system was created with the default options and the output of xfs_info looked good:

meta-data=/dev/mapper/lv01 isize=256    agcount=4, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The system had an existing XFS file system mounted so I decided to compare the metadata from the working file system with the one I just created. This showed one subtle differences:

meta-data=/dev/mapper/vg-var isize=256    agcount=4, agsize=131072 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=524288, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The working file system had the crc and ftype flags set to 0 while the non-working file system had crc set to 0 and ftype set to 1 (these are the defaults for the version of xfsprogs I was using). Re-creating the file system with ftype=0 allowed the file system to mount. I’m not one to sit back and enjoy a victory and move on. I wanted to know WHY it failed and the implications of setting ftype to 0. The mkfs.xfs manual page provides the following description for the ftype flag:

"ftype=value
        This feature allows the inode type to be stored 
        in the directory structure so that the readdir(3)  
        and  getdents(2)  do  not need to look up the inode 
        to determine the inode type.

        The value is either 0 or 1, with 1  signifiying  
        that  filetype  information  will  be stored in 
        the directory structure. The default value is 0.

        When  CRCs  are  enabled via -m crc=1, the ftype 
        functionality is always enabled. This feature can 
        not be turned off for such filesystem configurations."

I took this information and started to dig around to see when this feature was added to the Linux kernel. In Linux kernel 3.10 the XFS file system team added the foundation to perform CRC32 metadata checksums. You can read a great write up on this on the LKML. In Linux kernel 3.13 Ben Myers submitted a number of XFS patches which included the “add the inode directory type support to XFS_IOC_FSGEOM” change. This change added support to allow the inode type to be stored in the structure used to represent a directory in XFS.

The system I was running contained a version of xfsprogs that supported this feature but the xfs kernel module didn’t. Hence the file system creation worked but the kernel barfed when it tried to parse the file system metadata (the Linux kernel xfs_mount.c source was extremely useful for understanding what the error meant). Fortunately newer versions of xfsprogs enable CRC metadata checksumming by default which is a nice step towards achieving what ZFS introduced to the world several years ago. If you are interested in learning more about XFS I would highly recommend adding XFS Filesystem Structure to your reading queue.

A couple of quick and easy ways to display JSON data on the Linux command line

I interact with RESTful services daily and periodically need to review the JSON objects exposed through one or more endpoints. There are several Linux utilities that can take a JSON object and print the object in an easily readable form. The pygmentize utility (available in the python-pygments package) can be fed a JSON object via a file or STDIN:

$ curl http://bind:8080/json 2>/dev/null | pygmentize |more
{
  "json-stats-version":"1.2",
  "boot-time":"2017-09-09T11:56:04.442Z",
  "config-time":"2017-09-09T11:56:04.520Z",
  "current-time":"2017-09-09T12:10:36.054Z",
  "version":"9.11.2",
  .....
}

In the output above I’m retrieving a JSON object from the Bind statistics server and feeding it to pygmentize via STDIN. Pygmentize will take the object is given and produce a nightly formatted JSON object on STDOUT.

Pygmentize is super handy but the real power house of the JSON command line processors is jq. This amazing utility has numerous features which allow you to retrieve keys, values, objects and arrays and apply complex filters and operations to these elements. In its simplest form jq will take a JSON object and produce pretty output:

$ curl http://bind:8080/json 2>/dev/null | jq '.' |more
{
  "json-stats-version": "1.2",
  "boot-time": "2017-09-09T11:56:04.442Z",
  "config-time": "2017-09-09T11:56:04.520Z",
  "current-time": "2017-09-09T12:24:43.982Z",
  "version": "9.11.2",
  .....
}

To see the real power of jq we need to observe how to use operations and filters on a JSON object. The Bind statistics server produces a JSON object similar to the following (this was heavily edited to conserve space):

{
  "json-stats-version":"1.2",
  "boot-time":"2017-09-09T11:56:04.442Z",
  "config-time":"2017-09-09T11:56:04.520Z",
  "current-time":"2017-09-09T13:40:59.901Z",
  "version":"9.11.2",
  "codes":{
    "NOERROR":7798,
    "FORMERR":0,
    "SERVFAIL":2,
    "NXDOMAIN":166,
    "NOTIMP":0,
    "REFUSED":161,
    "YXDOMAIN":0,
    "YXRRSET":0,
    "NXRRSET":0,
    "NOTAUTH":0,
    "NOTZONE":0,
  },
  "qtypes":{
    "A":7023,
    "NS", 1,
    "PTR":153,
    "MX":1,
    "AAAA":950
  },
  .....
}

Lets say you wanted to view the number of A, NX, PTR and MX records queried. We can use the jq filter to grab the qtypes object and pass that through a filter to retrieve the values of the A, NS, PTR and MX keys:

$ curl http://bind:8080/json 2>/dev/null \
          | jq -r '.qtypes| "\(.A) \(.NS) \(.PTR) \(.MX)"'
7023 1 153 1

In this example I am using string interpolation to turn the values of A, NS, PTR and MX into a string which is then printed on STDOUT. Jq also has a number of useful math operations which can be applied to the values in a JSON object. To sum the totals of the various failure response codes in the rcodes object we can use the addition operation:

$ curl http://192.168.1.2:8080/json 2>/dev/null \
          | jq -r '.rcodes| .NXDOMAIN + .SERVFAIL \
                  + .REFUSED + .FORMERR'
8135

IN this example I am retrieving the values of the NXDOMAIN, SERVFAIL, REFUSED and FORMERR keys and sum’ing them with the addition operator. If you are new to jq or JSON I would highly suggest reading the jq manual and introducing JSON. These are excellent resources!