How the docker container creation process works (from docker run to runc)

This article was posted by Matty on 2018-02-19 09:42:34 -0500 -0500

Over the past few months I’ve been investing a good bit of personal time studying how Linux containers work. Specifically, what does docker run actually do. In this post I’m going to walk through what I’ve observed and try to demystify how all the pieces fit togther. To start our adventure I’m going to create an alpine container with docker run:

$ docker run -i -t --name alpine alpine ash

This container will be used in the output below. When the docker run command is invoked it parses the options passed on the command line and creates a JSON object to represent the object it wants docker to create. The object is then sent to the docker daemon through the /var/run/docker.sock UNIX domain socket. We can use the strace utility to observe the API calls:

$ strace -s 8192 -e trace=read,write -f docker run -d alpine

[pid 13446] write(3, "GET /_ping HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\n\r\n", 79) = 79
[pid 13442] read(3, "HTTP/1.1 200 OK\r\nApi-Version: 1.26\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 2\r\nContent-Type: text/plain; charset=utf-8\r\n\r\nOK", 4096) = 196
[pid 13442] write(3, "POST /v1.26/containers/create HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\nContent-Length: 1404\r\nContent-Type: application/json\r\n\r\n{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[],\"Cmd\":null,\"Image\":\"alpine\",\"Volumes\":{},\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":{},\"HostConfig\":{\"Binds\":null,\"ContainerIDFile\":\"\",\"LogConfig\":{\"Type\":\"\",\"Config\":{}},\"NetworkMode\":\"default\",\"PortBindings\":{},\"RestartPolicy\":{\"Name\":\"no\",\"MaximumRetryCount\":0},\"AutoRemove\":false,\"VolumeDriver\":\"\",\"VolumesFrom\":null,\"CapAdd\":null,\"CapDrop\":null,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":null,\"GroupAdd\":null,\"IpcMode\":\"\",\"Cgroup\":\"\",\"Links\":null,\"OomScoreAdj\":0,\"PidMode\":\"\",\"Privileged\":false,\"PublishAllPorts\":false,\"ReadonlyRootfs\":false,\"SecurityOpt\":null,\"UTSMode\":\"\",\"UsernsMode\":\"\",\"ShmSize\":0,\"ConsoleSize\":[0,0],\"Isolation\":\"\",\"CpuShares\":0,\"Memory\":0,\"NanoCpus\":0,\"CgroupParent\":\"\",\"BlkioWeight\":0,\"BlkioWeightDevice\":null,\"BlkioDeviceReadBps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteIOps\":null,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"Devices\":[],\"DiskQuota\":0,\"KernelMemory\":0,\"MemoryReservation\":0,\"MemorySwap\":0,\"MemorySwappiness\":-1,\"OomKillDisable\":false,\"PidsLimit\":0,\"Ulimits\":null,\"CpuCount\":0,\"CpuPercent\":0,\"IOMaximumIOps\":0,\"IOMaximumBandwidth\":0},\"NetworkingConfig\":{\"EndpointsConfig\":{}}}\n", 1556) = 1556
[pid 13442] read(3, "HTTP/1.1 201 Created\r\nApi-Version: 1.26\r\nContent-Type: application/json\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 90\r\n\r\n{\"Id\":\"b70b57c5ae3e25585edba898ac860e388582391907be4070f91eb49f4db5c433\",\"Warnings\":null}\n", 4096) = 281

Now here is were the real fun begins. Once the docker daemon receives the request it will parse the output and contact containerd via the gRPC API to set up the container runtime using the options passed on the command line. We can use the ctr utility to observe this interaction:

$ ctr --address "unix:///run/containerd.sock" events

TIME                           TYPE                           ID                             PID                            STATUS
time="2018-02-19T12:10:07.658081859-05:00" level=debug msg="Calling POST /v1.26/containers/create" 
time="2018-02-19T12:10:07.676706130-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged" 
time="2018-02-19T12:10:07.682430843-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/attach?stderr=1&stdin=1&stdout=1&stream=1" 
time="2018-02-19T12:10:07.683638676-05:00" level=debug msg="Calling GET /v1.26/events?filters=%7B%22container%22%3A%7B%22d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f%22%3Atrue%7D%2C%22type%22%3A%7B%22container%22%3Atrue%7D%7D" 
time="2018-02-19T12:10:07.684447919-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/start" 
time="2018-02-19T12:10:07.687230717-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged" 
time="2018-02-19T12:10:07.885362059-05:00" level=debug msg="sandbox set key processing took 11.824662ms for container d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f" 
time="2018-02-19T12:10:07.927897701-05:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-container\", Id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\", Status:0x0, Pid:\"\", Timestamp:(*timestamp.Timestamp)(0xc420bacdd0)}" 
2018-02-19T17:10:07.927795344Z start-container                d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f                                0
time="2018-02-19T12:10:07.930283397-05:00" level=debug msg="libcontainerd: event unhandled: type:\"start-container\" id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\" timestamp:<seconds:1519060207 nanos:927795344 > " 
time="2018-02-19T12:10:07.930874606-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/resize?h=35&w=115"

Setting up the container runtime is a pretty substantial undertaking. Namespaces need to be configured, the Image needs to be mounted, security controls (app armor profiles, seccomp profiles, capabilities) need to be enabled, etc , etc. You can get a pretty good idea of everything that is required to set up the runtime by reviewing the output of docker inspect containerid and the config.json runtime specification file (more on that in a moment).

Containerd doesn’t actually create the container runtime. It sets up the environment and then invokes containerd-shim to start the container runtime via the configured OCI runtime (controlled with the containerd “–runtime” option) . For most modern systems the container runtime is based on runc. We can see this first hand with the pstree utility:

$ pstree -l -p -s -T

systemd,1 --switched-root --system --deserialize 24
  ├─docker-containe,19606 --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug
  │   ├─docker-containe,19834 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current

Since pstree truncates the process name we can verify the PIDs with ps:

$ ps auxwww | grep [1]9606

root     19606  0.0  0.2 685636 10632 ?        Ssl  13:01   0:00 /usr/libexec/docker/docker-containerd-current --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug

$ ps auxwww | grep [1]9834

root     19834  0.0  0.0 527748  3020 ?        Sl   13:01   0:00 /usr/libexec/docker/docker-containerd-shim-current 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current

When I first started researching the interaction between dockerd, containerd and the shim I wasn’t real sure what purpose the shim served. Luckily Google took me to a great write up by Michael Crosby. The shim serves a couple of purposes:

It allows you to run daemonless containers.
STDIO and other FDs are kept open in the event that containerd and docker die.
Reports the containers exit status to containerd.

The first and second bullet points are super important. These features allows the container to be decoupled from the docker daemon allowing dockerd to be upgraded or restarted w/o impacting the running containers. Nifty! I mentioned that the shim is responsible for kicking off runc to actually run the container. Runc needs two things to do its job: a specification file and a path to a root file system image (the combination of the two is referred to as a bundle). To see how this works we can create a rootfs by exporting the alpine docker image:

$ mkdir -p alpine/rootfs

$ cd alpine

$ docker export d1a6d87886e2 | tar -C rootfs -xvf -

time="2018-02-19T12:54:13.082321231-05:00" level=debug msg="Calling GET /v1.26/containers/d1a6d87886e2/export" 
.dockerenv
bin/
bin/ash
bin/base64
bin/bbconfig
.....

The export option takes a container if which you can find in the docker ps -a output. To generate a specificationfile you can use the runc spec command:

$ runc spec

This will create a specification file named config.json in your current directory. This file can be customized to suit your needs and requirements. Once you are happy with the file you can run runc with the rootfs directory as its sole argument (the container configuration will be read from the file config.json file):

$ runc run rootfs

This simple example will spawn an alpine ash shell:

$ runc run rootfs

/ # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.7.0
PRETTY_NAME="Alpine Linux v3.7"
HOME_URL="http://alpinelinux.org"
BUG_REPORT_URL="http://bugs.alpinelinux.org"

Being able to create containers and play with the runc runtime specification is incredibly powerful. You can evaluate different apparmor profiles, test out Linux capabilities and play around with every facet of the container runtime environment without needing to install docker. I just barely scratched the surface here and would highly recommend reading through the runc and containerd documentation. Super cool stuff!

Exporting Docker Images For Analysis

This article was posted by Matty on 2018-02-17 10:55:17 -0500 -0500

I was recently looking into a docker problem and wanted to see the actual contents of the image that was having issues. Typically you can grab the Dockerfile that was used to build the image and use docker build to recreate it. I wasn’t able to locate the Dockerfile for this image so I decided to extract the image to a local file system for analysis. This was super easy to do with the docker export command.

To illustrate how this is done lets say you want to extract the contents of the Kubernetes heapster image to the a directory named rootfs. If you run docker export with the container ID (you can get this from docker inspect and docker ps) and pipe that to tar the rootfs directory will be populated with the image contents once the command completes:

$ docker export $(docker ps -a | awk '/heapster/ && !/POD/ {print $1}') | tar -C rootfs -xvf -

.dockerenv
dev/
dev/console
dev/pts/
dev/shm/
etc/
etc/hostname
etc/hosts
etc/mtab
etc/resolv.conf
etc/ssl/
etc/ssl/certs/
etc/ssl/certs/ca-certificates.crt
eventer
heapster
proc/
run/
run/secrets/
sys/
var/
var/run/
var/run/secrets/
var/run/secrets/kubernetes.io/
var/run/secrets/kubernetes.io/serviceaccount/

This has also been super useful for studying how runc works under the covers.

Notes from episode 26 of TGIK: Helm Yeah!

This article was posted by Matty on 2018-02-17 08:00:00 -0500 -0500

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode twenty-six Joe handed the microphone to Kris Nova to discuss helm. You can watch it here:

Here are some of my takeways from this episode:

The kooper project is a Go library for creating Kubernetes operators and controllers.
You can watch Kubernetes build events on the prow website.
Helm is a tool for managing pre-configured Kubernetes resources. These resources are packaged into charts.
GitHub has a list of the available charts.
There are two components to helm:
- The server which is referred to as tiller.
- The helm client.
Helm installation instructions:
- Download the helm client to your admin station(s).
- Copy it to a known location.
- Run helm init to initialize and client and server.
- Run helm repo update to update your local repository.
You can list the built-in helm repositories with the repo command:
$ helm repo list
You can update repository information with the repo command:
$ helm repo update
The helm search command can be used to find charts:
$ helm search wordpress
The install command can be used to install a chart:
$ helm install stable/wordpress --name foo --set wordpressUsername=foo,wordpressPassword=1234
The draft project can be used to deploy applications to Kubernetes. Draft uses helm behind the covers.
Charts consist of one or more YAML files to define the resources (chartname/templates) and their configuration (chartname/values.yaml).
Helm utilizes the built-in Go template language and and extends it further via the sprig project.
Helm installs with complete global cluster access by default. Be careful!
Tiller uses ConfigMaps to store release information.
RBAC support via helm init is being discussed at the Helm summit. You currently need to add it manually.
Bitnami provided a great write up on the security concerns surrounding helm.

Things I need to learn more about:

Need to study how helm fits into CI/CD pipelines.
Need to create a chart from scratch to see what is involved.
I am petrified of public container images and it looks like helm makes heavy use of them. Need to study the security implications associated with using helm.

Notes from episode 11 of TGIK: Upgrading to 1.8 with kubeadm

This article was posted by Matty on 2018-02-16 16:00:00 -0500 -0500

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode eleven Joe talks about upgrading with kubeadm. You can watch it here:

Here are some of my takeways from the episode:

The release notes describe new features and gotchas related to the upgrade process.
APIs can change and move over time. Important to review the release notes and test before upgrading.
Kubernetes versions their APIs with three tags:
- Alpha level - Work in progress w/o any guarantees.
- Beta level - Well tested, features won’t be dropped but details may change.
- Stable level - Production ready and will be available for the known future.
Newer releases of Kubernetes will expire the bootrap tokens after 24-hours.
Self hosting is the term used when Kubernetes manages Kubernetes.
The kubeadm deployment manifests are stored in /etc/kubernetes/manifests.
The kubelet picks up changes to files in /etc/kubernetes/manifests and uses these to bootstrap the control plane. The API server isn’t contacted when static pod defintions are used.
Kubelet checkpointing (once it’s released) will allow the current cluster state to be saved. If a cluster goes dark and comes back these checkpoints will be used to restore the cluster to a known state.
You can find the control plane versions from the image tags:
$ kubectl get pods -n kube-system kube-controller-manager-kub1 -o yaml | grep image:
Here is the actual upgrade process:
- Review the release notes for Kubernetes and your network solution.
- Plan the upgrade with kubeadm upgrade plan.
- Perform the upgrade with kubeadm upgrade apply VERSION_TO_UPGRADE_2.
- Upgrade kubelet on each worker.
- Upgrade the kubectl utility.
The kubectl version command shows the API server version:
$ kubectl version --short
SIG cluster lifecycle is where kubeadm changes are discussed.
If you are upgrading from an older version you may need to pass the flags you passed to kubeadm init to the upgrade command. Newer releases store this information in a ConfigMap.
You can view the upgrade ConfigMap with kubectl:
$ kubectl get configmaps -n kube-system kubeadm-config -o yaml
The kube-proxy runs as a DaemonSet across all of your nodes.
Docker is split out into three pieces:
- dockerd is the daemon resposible for managing images and containers.
- docker-containerd provides the actual abstractions to manage containers and container networking.
- docker-containerd-shim provides the actual runtime (typically via runc) to run a container.

Things I need to larn more about:

Play with kubeadm HA.
Keep reading through the API documentation.
Read the container runtime specification.

Notes from episode 10 of TGIK: Ingress with TLS

This article was posted by Matty on 2018-02-16 15:00:00 -0500 -0500

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode ten Joe discusses Ingress and kube-lego. You can watch it here:

Here are some of my takeways from the episode:

Ingress is an API object that manages external access to one or more services in a cluster.
API groups are used to break up the API into small units. These units can be extended, deprecated, etc. without impacting other APIs.
Pod priorities allow you to ensure that pods continue to run at the expense of lower priority pods. Priorities also impact the scheduling order.
GeoDNS allows you deliver an answer to a questions that takes into account the users location.
A service type of LoadBalancer is used to configure an external load balancer (e.g., AWS ELB).
Several ingress controllers are currently available:
- Nginx
- HAProxy
- Heptio’s contour controller
- Traefik
- Linkerd
- Several others
Ingress controllers take a set of objects and configure the ingress controller. The Ingress controller runs separate from kube-controller-manager.
You can show labels with the “–show-labels” option:
$ kubectl get nodes -o wide --show-labels
Ingress controllers use config maps to publish status. Controllers also perform a leader election so only one controller updates the status.
Ingress controllers are configured with the service name to forward requests to.
Ingress rules contain the criteria (e.g., hostname of prefetch.net, URI, etc.) used to forward requests.
The nginx Ingress controller reads the Ingress objects, translates them to nginx configuration files and then reloads nginx to pick up the changes.
The nginx Ingress controller is also responsible for starting the nginx daemon processes.
You can observe the status of an ingress controller with the logs command:
$ kubectl logs -n kube-system ingress-controller-xxx-xxxx
The nginx configuration is part of the GitHub ingress-nginx project. Sample configurations can be retreived from the deploy directory.
The Ingress object is baked into Kubernetes.
Ingress works across all namespaces.
The kube-lego project (now known as cert-manager) can be used to automate the installationand renewal of Let’s Encrypt certificates.

Things I need to learn more about:

Set up an nginx ingress controller.
Study the objects that are created and how the controller interacts with services.
Play around with cert-manager to provision Let’s Encrypt certificates.