Making sense of docker storage drivers

This article was posted by Matty on 2016-10-15 16:41:00 -0400 -0400

Docker has a pluggable storage architecture which currently contains 6 drivers.

AUFS - Original docker storage driver.
OverlayFS Driver built on top of overlayfs.
Btrfs Driver built on top of brtfs.
Device Mapper Driver built on top of the device mapper.
ZFS Driver built on top of the ZFS file system.
VFS A VFS-layer driver that isn't considered suitable for production.

If you have docker installed you can run ‘docker info’ to see which driver you are using:

$ docker info | grep "Storage Driver:"

Storage Driver: devicemapper

Picking the right driver isn’t straightforward due to how fast docker and the storage drivers are evolving. The docker documentation has some excellent suggestions and you can’t go wrong using the most widely used drivers. I have hit a couple of bugs with the overlayfs driver and I have never bothered with the devicemapper driver with loopback files (vs. the device mapper driver w/ direct LVM) because of Jason’s post.

My biggest storage lesson learned (i.e., I do this because I hit bugs) from the past year is to give docker a chunk of dedicated storage. This space can reside in your root volume group, a dedicated volume group or in a partition. To use a dedicated volume group you can add “VG=VOLUME_GROUP” to /etc/sysconfig/docker-storage-setup:

$ cat /etc/sysconfig/docker-storage-setup

VG="docker"

To use a dedicate disk you can add “DEV=BLOCK_DEVICE” to /etc/sysconfig/docker-storage-setup:

$ cat /etc/sysconfig/docker-storage-setup

DEVS="/dev/sdb"

If either of these variables are set docker-storage-setup will create an LVM thin pool which docker will use to layer images. This layering is the foundation that docker containers are built on top of.

If you change VG or DEVS and docker is operational you will need to backup up your images, clean up /var/lib/docker and then run docker-storage-setup to apply the changes. The following shows what happens if docker-storage-setup is run w/o any options set:

$ docker-storage-setup

Rounding up size to full physical extent 412.00 MiB
Logical volume "docker-poolmeta" created.
Logical volume "docker-pool" created.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted docker/docker-pool to thin pool.
Logical volume docker/docker-pool changed.

This create the data and metadata volumes in the root volume group and updates the docker configuration. If anyone is using the brtfs or zfs storage drivers shoot me a note to let me know what your experience has been.

Viewing Linux tape drive statistics with tapestat

This article was posted by Matty on 2016-10-15 13:59:00 -0400 -0400

A while back I wrote a blog entry showing how to get tape drives statistics with systemtap. This script wasn’t very reliable and I would frequently see it crash after collecting just a few samples. Due to the work of some amazing Linux kernel engineers I no longer have to touch systemtap. Recent Linux kernels now expose a number of incredibly useful statistics through the /sys file system:

$ pwd
/sys/class/scsi_tape/nst0/stats

$ ls -l

total 0
-r--r--r-- 1 root root 4096 Oct 10 16:15 in_flight
-r--r--r-- 1 root root 4096 Oct 10 16:15 io_ns
-r--r--r-- 1 root root 4096 Oct 10 16:15 other_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 read_byte_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 read_cnt
-r--r--r-- 1 root root 4096 Oct 10 16:15 read_ns
-r--r--r-- 1 root root 4096 Oct 10 16:15 resid_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 write_byte_cnt
-r--r--r-- 1 root root 4096 Oct 10 15:30 write_cnt
-r--r--r-- 1 root root 4096 Oct 10 16:15 write_ns

There is also a tapestats utility in the syststat package that can be used to summarize these statistics:

$ tapestat -z 1

Linux 2.6.32-642.1.1.el6.x86_64 (wolfie) 10/10/2016 _x86_64_ (24 CPU)

Tape: r/s w/s kB_read/s kB_wrtn/s %Rd %Wr %Oa Rs/s Ot/s
st0 0 370 0 94899 0 22 22 0 0
st1 0 367 0 93971 0 18 19 0 0
st2 0 315 0 80885 0 19 19 0 0
st3 0 27 0 6979 0 1 1 0 0

Tape: r/s w/s kB_read/s kB_wrtn/s %Rd %Wr %Oa Rs/s Ot/s
st0 0 648 0 165888 0 30 30 0 0
st2 0 362 0 92928 0 17 17 0 0

This is a useful addition and I no longer have to worry about systemtap croaking when I’m tracking down issues.

Using systemd to restart processes that crash

This article was posted by Matty on 2016-10-12 09:50:00 -0400 -0400

The gmetad process on my Ganglia server has been a bit finicky lately. Periodically it segfaults which prevents new metrics from making their way into the RRD databases it manages:

[14745149.528104] gmetad[24286]: segfault at 0 ip 00007fb498c413c1 sp 00007fb48db40358 error 4 in libc-2.17.so[7fb498ade000+1b7000]

Luckily The gmetad service runs under systemd which provides a Restart directive to revive failed processes. You can take advantage of this nifty feature by adding “Restart=always” to yourunit files:

$ cat /usr/lib/systemd/system/gmetad.service

[Unit]
Description=Ganglia Meta Daemon
After=network.target

[Service]
Restart=always
ExecStart=/usr/sbin/gmetad -d 1

[Install]
WantedBy=multi-user.target

Now each time gmetad pukes systemd will automatically restart it. Hopefully I will get some time in the next few weeks to go through the core file to see why it keeps puking. Until then, this band aid should work rather nicely.

Riemann and the case of the unhappy system call

This article was posted by Matty on 2016-10-10 17:52:00 -0400 -0400

This past weekend I spent a good deal of time playing with riemann. Riemann is a powerful stream processor and I’m hoping to use it to correlated and analyze metrics from disparate data sources. After downloading and installing it I received the following error:

$ ./riemann

ERROR [2016-10-10 12:21:36,614] main - riemann.bin - Couldn't start
java.util.concurrent.ExecutionException:
java.lang.UnsatisfiedLinkError:
/tmp/libnetty-transport-native-epoll3719688563306389605.so:
/tmp/libnetty-transport-native-epoll3719688563306389605.so: failed to
map segment from shared object

Not exactly what I was exepecting on our first date but I guess riemann plays hard to get. :) To make a little more sense out of this error I fired up strace to retrieve the ERRNO value and to see which system call was failing:

$ strace -f ./riemann

.....
[pid 9826]
open("/tmp/libnetty-transport-native-epoll27354233456383270.so",
O_RDONLY|O_CLOEXEC) = 45
[pid 9826] read(45,
"ELF
832) = 832
[pid 9826] fstat(45, {st_mode=S_IFREG|0644, st_size=63749, ...}) =
0
[pid 9826] mmap(NULL, 2146168, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_DENYWRITE, 45, 0) = -1 EPERM (Operation not
permitted)
.....

Now that hits the spot! mmap() is getting an EPERM which the mmap(2) manual page describes as:

“EPERM The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.”

One problem solved! The tmp file system is mounted with the noexec flag which prohibits the mmap() from running with PROT_EXEC set. Here’s a picture of tmp:

$ mount | grep "/tmp"

tmpfs on /tmp type tmpfs (rw,noexec)

I’m not one to disable security measures so I went looking for a workaround. After reading through the netty source code for 20 minutes I came across this nugget:

NativeLibraryLoader.java
f = toDirectory(SystemPropertyUtil.get("java.io.tmpdir"));
if (f != null) {
    logger.debug("-Dio.netty.tmpdir: " + f + " (java.io.tmpdir)");
return f;
}

Netty uses the java.io.tmpdir property to craft the temporary file location. Further digging showed that riemann passes properties to the Java runtime through the EXTRA_JAVA_OPTS variable:

exec java $EXTRA_JAVA_OPTS $OPTS -cp "$JAR" riemann.bin "$COMMAND" "$CONFIG"

Furthermore, you can pass properties to the Java runtime via the “-D” option. So to make riemann happy I set the EXTRA_JAVA_OPTS environment variable to “-Djava.io.tmpdir=/path/to/new/tmpdir” and fired off riemann:

$ EXTRA_JAVA_OPTS=-Djava.io.tmpdir=/sfw/riemann/latest/tmp cd
$ RIEMANN_HOME && ./riemann

Bingo! Riemann started and I was able to start working with it. Clojure takes me back to my Lisp programming days in college. Glad vim has () matching built in! :)

Python generators you had me at first yield

This article was posted by Matty on 2016-10-08 10:50:00 -0400 -0400

This post is added as a reference for myself.

I’ve been reading a lot about writing clean, readable and performant Python code. One feature that I now adore are generators. In the words of Python guru David Beazley a generator is:

“a function that produces a sequence of results instead of a single value”

The presentation above is an incredible overview of this amazing feature and I’ve been taking advantage of it whenever I can. Here is one case where I was able to use this today:

#!/usr/bin/env python

import os
import re

SYSFS_TAPE_PATH = "/sys/class/scsi_tape/"

def find_drives():
    """
    Return a list of tape drives
    """
    for drive in os.listdir(SYSFS_TAPE_PATH):
        if re.match(r'^nst[0-9]+$', drive):
            yield drive

    for drive in find_drives():
        print drive

Running the script produces all of the drives on the system:

$ ./generator

nst0
nst1
nst2
nst3

In the past I would have created a list and returned it after processing all of the drives in the scsi_tape directory. With generators that is no longer necessary and I find the code is much more readable. A huge thanks to David for the great presentation! Hopefully one day I will get to take one of his awesome Python programming classes.