You've heard about systemd, right? The init system wrote (not only) by famous Lennart Poettering which is trying to eat all those nice ancient tools and system parts (just kidding😉 but which also makes live of many system administrator waaaaay easier. You've probably also heard about Docker - the tool which currently leads the world of linux containers (and soon will do the same in Windows world probably). So, where does these 2 projects and programs meet?
Scott Collier asked me to provide him a status where we are with "running systemd in Docker containers" in Fedora and let him know how to actually make it work. If you searched Google before, you've probably found Dan Walsh's article about running systemd inside a container. Some things changed, some (hopefully) will change sooner or later, but I can tell you that it's quite easy now to run services in a Docker container by systemd.
First things first. To be able to run systemd we need few things - cgroup tree, /run and /tmp to be a mountpoint (preferably on tmpfs), environment variable container to be set to "docker", get rid of fstab and mount units, tweak dbus.service a bit and that's it. Some of them are on you, we took care of the rest in Fedora base images.
Well, it's simple - systemd touches /sys/fs/cgroup and expects it to be populated. As kernel won't populate cgroups in container, we need to mount it from the host. Easy, right? Sadly this cannot be done automatically as Docker tries to stay above these distribution specific modifications (which is good..mostly). So you need to add
to your docker create/run cmdline.
We can blame both - systemd and Docker from not being able to solve this for us automatically. We need either that systemd does not require /run and /tmp to be mount points or that Docker provides volumes for them by default. I think I understand both points of view. It's again a distribution specific change for Docker and at the same time it's a sane default for systemd to require to have /tmp and /run really temporary. So how to get around this? Let's add another volume to our image (there is a PR for Docker to do it automatically). Contrary to the cgroup mount, this does not have to lead to any specific location on host. So the command line solution would be
-v /run -v /tmp
or in Dockerfile
VOLUME ["/run", "/tmp"]
There are ways for systemd to figure out where and how it runs. It checks bunch of things and one of them is environment variable
$container. It can equal to few things (f.e. lxc) but here we, for obvious reasons, want to have it's value set to
docker. So on command line you would need
or in Dockerfile
ENV container docker
There is another variable systemd can use. It's called
$container_uuid and it is used to set the
/etc/machine-id. That can be very useful because it for example identifies your container in journald. Wouldn't it be awesome if we could get this set up automatically by Docker daemon when the container is created? There is a (closed) PR on Docker for this.
Docker containers drop sysadmin capability which is good for security but bad for systemd. It tries to do some mounting on start up and it expectedly fails. The easiest way of getting rid of these fails is 1) to remove /etc/fstab and 2) to mask mount units which systemd ships (I've found these in a Fedora base image: dev-hugepages.mount, sys-fs-fuse-connections.mount). Both is done in fedora-base-docker.ks in %post section (which is used to build the base image).
This again has something to do with capabilities. Dbus service tries to change it's OOMScore in unit file which fails. But this time it fails quite badly - sometimes the container dies completely, sometimes systemd says it's logging to fast and freezes, but in all cases the container is basically useless. It should be fixed in latest systemd builds in Fedora, but I still hit this in fedora:21 image. To solve this for your containers, please add this line to your Dockerfile
RUN cp /usr/lib/systemd/system/dbus.service /etc/systemd/system/; sed -i 's/OOMScoreAdjust=-900//' /etc/systemd/system/dbus.service
Ok, now I hopefully convinced you that running your services in containers by systemd is easy. sadly, what you need at the moment is to create another layer over Fedora base image. You can do that with this Dockerfile:
FROM fedora MAINTAINER Vaclav Pavlin <email@example.com> RUN yum -y update; yum clean all RUN systemctl mask systemd-remount-fs.service dev-hugepages.mount sys-fs-fuse-connections.mount systemd-logind.service getty.target console-getty.service RUN cp /usr/lib/systemd/system/dbus.service /etc/systemd/system/; sed -i 's/OOMScoreAdjust=-900//' /etc/systemd/system/dbus.service VOLUME ["/sys/fs/cgroup", "/run", "/tmp"] ENV container=docker CMD ["/usr/sbin/init"]
Build it for example like this
docker build -t fedora:systemd .
Or use an image I've prepared for you on Docker Hub:
Following command will do the work:
docker run -it --rm -v /sys/fs/cgroup:/sys/fs/cgroup:ro fedora:systemd
Some lines will be redundant in the Dockerfile above when F22 will be released so I'll probably update the article when we get there.
By the way, you probably want to continue with next post: Running services with docker and systemd.