How the docker container creation process works (from docker run to runc)


Over the past few months I’ve been investing a good bit of personal time studying how Linux containers work. Specifically, what does docker run actually do. In this post I’m going to walk through what I’ve observed and try to demystify how all the pieces fit togther. To start our adventure I’m going to create an alpine container with docker run:

$ docker run -i -t --name alpine alpine ash

This container will be used in the output below. When the docker run command is invoked it parses the options passed on the command line and creates a JSON object to represent the object it wants docker to create. The object is then sent to the docker daemon through the /var/run/docker.sock UNIX domain socket. We can use the strace utility to observe the API calls:

$ strace -s 8192 -e trace=read,write -f docker run -d alpine

[pid 13446] write(3, "GET /_ping HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\n\r\n", 79) = 79
[pid 13442] read(3, "HTTP/1.1 200 OK\r\nApi-Version: 1.26\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 2\r\nContent-Type: text/plain; charset=utf-8\r\n\r\nOK", 4096) = 196
[pid 13442] write(3, "POST /v1.26/containers/create HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\nContent-Length: 1404\r\nContent-Type: application/json\r\n\r\n{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[],\"Cmd\":null,\"Image\":\"alpine\",\"Volumes\":{},\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":{},\"HostConfig\":{\"Binds\":null,\"ContainerIDFile\":\"\",\"LogConfig\":{\"Type\":\"\",\"Config\":{}},\"NetworkMode\":\"default\",\"PortBindings\":{},\"RestartPolicy\":{\"Name\":\"no\",\"MaximumRetryCount\":0},\"AutoRemove\":false,\"VolumeDriver\":\"\",\"VolumesFrom\":null,\"CapAdd\":null,\"CapDrop\":null,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":null,\"GroupAdd\":null,\"IpcMode\":\"\",\"Cgroup\":\"\",\"Links\":null,\"OomScoreAdj\":0,\"PidMode\":\"\",\"Privileged\":false,\"PublishAllPorts\":false,\"ReadonlyRootfs\":false,\"SecurityOpt\":null,\"UTSMode\":\"\",\"UsernsMode\":\"\",\"ShmSize\":0,\"ConsoleSize\":[0,0],\"Isolation\":\"\",\"CpuShares\":0,\"Memory\":0,\"NanoCpus\":0,\"CgroupParent\":\"\",\"BlkioWeight\":0,\"BlkioWeightDevice\":null,\"BlkioDeviceReadBps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteIOps\":null,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"Devices\":[],\"DiskQuota\":0,\"KernelMemory\":0,\"MemoryReservation\":0,\"MemorySwap\":0,\"MemorySwappiness\":-1,\"OomKillDisable\":false,\"PidsLimit\":0,\"Ulimits\":null,\"CpuCount\":0,\"CpuPercent\":0,\"IOMaximumIOps\":0,\"IOMaximumBandwidth\":0},\"NetworkingConfig\":{\"EndpointsConfig\":{}}}\n", 1556) = 1556
[pid 13442] read(3, "HTTP/1.1 201 Created\r\nApi-Version: 1.26\r\nContent-Type: application/json\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 90\r\n\r\n{\"Id\":\"b70b57c5ae3e25585edba898ac860e388582391907be4070f91eb49f4db5c433\",\"Warnings\":null}\n", 4096) = 281

Now here is were the real fun begins. Once the docker daemon receives the request it will parse the output and contact containerd via the gRPC API to set up the container runtime using the options passed on the command line. We can use the ctr utility to observe this interaction:

$ ctr --address "unix:///run/containerd.sock" events

TIME                           TYPE                           ID                             PID                            STATUS
time="2018-02-19T12:10:07.658081859-05:00" level=debug msg="Calling POST /v1.26/containers/create" 
time="2018-02-19T12:10:07.676706130-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged" 
time="2018-02-19T12:10:07.682430843-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/attach?stderr=1&stdin=1&stdout=1&stream=1" 
time="2018-02-19T12:10:07.683638676-05:00" level=debug msg="Calling GET /v1.26/events?filters=%7B%22container%22%3A%7B%22d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f%22%3Atrue%7D%2C%22type%22%3A%7B%22container%22%3Atrue%7D%7D" 
time="2018-02-19T12:10:07.684447919-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/start" 
time="2018-02-19T12:10:07.687230717-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged" 
time="2018-02-19T12:10:07.885362059-05:00" level=debug msg="sandbox set key processing took 11.824662ms for container d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f" 
time="2018-02-19T12:10:07.927897701-05:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-container\", Id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\", Status:0x0, Pid:\"\", Timestamp:(*timestamp.Timestamp)(0xc420bacdd0)}" 
2018-02-19T17:10:07.927795344Z start-container                d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f                                0
time="2018-02-19T12:10:07.930283397-05:00" level=debug msg="libcontainerd: event unhandled: type:\"start-container\" id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\" timestamp:<seconds:1519060207 nanos:927795344 > " 
time="2018-02-19T12:10:07.930874606-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/resize?h=35&w=115" 

Setting up the container runtime is a pretty substantial undertaking. Namespaces need to be configured, the Image needs to be mounted, security controls (app armor profiles, seccomp profiles, capabilities) need to be enabled, etc , etc. You can get a pretty good idea of everything that is required to set up the runtime by reviewing the output of docker inspect containerid and the config.json runtime specification file (more on that in a moment).

Containerd doesn’t actually create the container runtime. It sets up the environment and then invokes containerd-shim to start the container runtime via the configured OCI runtime (controlled with the containerd “–runtime” option) . For most modern systems the container runtime is based on runc. We can see this first hand with the pstree utility:

$ pstree -l -p -s -T

systemd,1 --switched-root --system --deserialize 24
  ├─docker-containe,19606 --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug
  │   ├─docker-containe,19834 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current

Since pstree truncates the process name we can verify the PIDs with ps:

$ ps auxwww | grep [1]9606

root     19606  0.0  0.2 685636 10632 ?        Ssl  13:01   0:00 /usr/libexec/docker/docker-containerd-current --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug

$ ps auxwww | grep [1]9834

root     19834  0.0  0.0 527748  3020 ?        Sl   13:01   0:00 /usr/libexec/docker/docker-containerd-shim-current 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current

When I first started researching the interaction between dockerd, containerd and the shim I wasn’t real sure what purpose the shim served. Luckily Google took me to a great write up by Michael Crosby. The shim serves a couple of purposes:

  1. It allows you to run daemonless containers.
  2. STDIO and other FDs are kept open in the event that containerd and docker die.
  3. Reports the containers exit status to containerd.

The first and second bullet points are super important. These features allows the container to be decoupled from the docker daemon allowing dockerd to be upgraded or restarted w/o impacting the running containers. Nifty! I mentioned that the shim is responsible for kicking off runc to actually run the container. Runc needs two things to do its job: a specification file and a path to a root file system image (the combination of the two is referred to as a bundle). To see how this works we can create a rootfs by exporting the alpine docker image:

$ mkdir -p alpine/rootfs

$ cd alpine

$ docker export d1a6d87886e2 | tar -C rootfs -xvf -

time="2018-02-19T12:54:13.082321231-05:00" level=debug msg="Calling GET /v1.26/containers/d1a6d87886e2/export" 
.dockerenv
bin/
bin/ash
bin/base64
bin/bbconfig
.....

The export option takes a container if which you can find in the docker ps -a output. To generate a specificationfile you can use the runc spec command:

$ runc spec

This will create a specification file named config.json in your current directory. This file can be customized to suit your needs and requirements. Once you are happy with the file you can run runc with the rootfs directory as its sole argument (the container configuration will be read from the file config.json file):

$ runc run rootfs

This simple example will spawn an alpine ash shell:

$ runc run rootfs

/ # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.7.0
PRETTY_NAME="Alpine Linux v3.7"
HOME_URL="http://alpinelinux.org"
BUG_REPORT_URL="http://bugs.alpinelinux.org"

Being able to create containers and play with the runc runtime specification is incredibly powerful. You can evaluate different apparmor profiles, test out Linux capabilities and play around with every facet of the container runtime environment without needing to install docker. I just barely scratched the surface here and would highly recommend reading through the runc and containerd documentation. Super cool stuff!

This article was posted by Matty on 2018-02-19 09:42:34 -0500 EST