Understand what is and isn’t provide inside containerd. This document provide the full scope of the project
History background on the reason why networking was left out from containerd
containerd-shim – After runc runs the container, it exits (allowing us to not have any long-running processes responsible for our containers). The shim is the component which sits between containerd and runc to facilitate this. Containers does not died when dockerd orcontainerd died as it is ‘attached’ to the containerd-shim process. The containerd-shim process job is to monitor stdin(out) and report back the error code returned from exiting the container
Some of containerd Makefile task:
Following are some explanation about containerd source code:
Examples how to use containerd
{
"ID": "ubuntulatest",
"Labels": {
"io.containerd.image.config.stop-signal": "SIGTERM"
},
"Image": "docker.io/library/ubuntu:latest",
"Runtime": {
"Name": "io.containerd.runc.v2",
"Options": {
"type_url": "containerd.runc.v1.Options"
}
},
"SnapshotKey": "ubuntulatest",
"Snapshotter": "overlayfs",
"CreatedAt": "2020-01-01T00:24:30.509643667Z",
"UpdatedAt": "2020-01-01T00:24:30.509643667Z",
"Extensions": null,
"Spec": {
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/bash"
],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"cwd": "/",
"capabilities": {
"bounding": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"effective": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"inheritable": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"permitted": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
]
},
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 1024,
"soft": 1024
}
],
"noNewPrivileges": true
},
"root": {
"path": "rootfs"
},
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/run",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
}
],
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 3,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 8,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 7,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 0,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 5,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 9,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 1,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 136,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 2,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 10,
"minor": 200,
"access": "rwm"
}
]
},
"cgroupsPath": "/default/ubuntulatest",
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "network"
}
],
"maskedPaths": [
"/proc/acpi",
"/proc/asound",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/sys/firmware",
"/proc/scsi"
],
"readonlyPaths": [
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
}
ID TIMESTAMP
ubuntulatest 2020-01-01 00:29:16.295322149 +0000 UTC
METRIC VALUE
memory.usage_in_bytes 1433600
memory.limit_in_bytes 9223372036854771712
memory.stat.cache 0
cpuacct.usage 19273664
cpuacct.usage_percpu [205817 2386548 4771841 148926 5529908 227097 2037362 0 2107441 0 1017213 841511]
pids.current 1
pids.limit 0
{
"id": "ubuntulatest",
"bundle": "/run/containerd/io.containerd.runtime.v2.task/default/ubuntulatest",
"pid": 16032,
"status": 2,
"stdin": "/run/containerd/fifo/249557939/ubuntulatest-stdin",
"stdout": "/run/containerd/fifo/249557939/ubuntulatest-stdout",
"stderr": "/run/containerd/fifo/249557939/ubuntulatest-stderr",
"exited_at": "0001-01-01T00:00:00Z"
}
docker.io/library/hello-world:latest: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:4fe721ccc2e8dc7362278a29dc660d833570ec2682f4e4194f4ee23e415e1064: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:92c7f9c92844bbbb5d0a101b22f7c2a7949e40f8ea90c8b3bc396879d95e899a: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:1b930d010525941c1d56ec53b97bd057a67ae1865eebf042686d2a2d18271ced: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:fce289e99eb9bca977dae136fbe2a82b6b7d4c372474c9235adc1741675f587e: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 3.7 s total: 4.8 Ki (1.3 KiB/s)
unpacking linux/amd64 sha256:4fe721ccc2e8dc7362278a29dc660d833570ec2682f4e4194f4ee23e415e1064...
containerd utilise kernel feature called ‘reaper’ to reparent the container proces to the shim
nanik 7741 3230 0 2019 ? 00:02:54 \_ /usr/libexec/gnome-terminal-server
nanik 7750 7741 0 2019 pts/1 00:00:00 | \_ bash
.....
.....
nanik 19294 7741 0 13:45 pts/5 00:00:00 | \_ bash
root 5123 19294 0 17:51 pts/5 00:00:00 | | \_ sudo ./ctr run -t docker.io/library/ubuntu:latest u13
root 5124 5123 0 17:51 pts/5 00:00:00 | | \_ ./ctr run -t docker.io/library/ubuntu:latest u13
nanik 18313 7741 0 16:11 pts/11 00:00:00 | \_ bash
.....
.....
.....
.....
.....
root 5884 3230 0 17:52 ? 00:00:00 \_ /usr/bin/containerd-shim-runc-v2 -namespace default -id u13 -address /run/containerd/containerd.sock
root 5906 5884 0 17:52 ? 00:00:00 \_ /bin/bash
.....
.....
The shim is executed out-of-process (executed with exec(..)) and the following are used to execute it:
0 = {string} "-namespace"
1 = {string} "default"
2 = {string} "-address"
3 = {string} "/run/containerd/containerd.sock"
4 = {string} "-publish-binary"
5 = {string} "/tmp/___containerd"
6 = {string} "-id"
7 = {string} "u8"
8 = {string} "-debug"
9 = {string} "start"
The /tmp/__containerid contains the containerd executable.
Comment by Michael Crosby about shim
The shim allows for daemonless containers. It basically sits as the parent of the container's process to facilitate a few things.
First it allows the runtimes, i.e. runc,to exit after it starts the container. This way we don't have to have the long running runtime processes for containers. When you start mysql you should only see the mysql process and the shim.
Second it keeps the STDIO and other fds open for the container incase containerd and/or docker both die. If the shim was not running then the parent side of the pipes or the TTY master would be closed and the container would exit.
Finally it allows the container's exit status to be reported back to a higher level tool like docker without having the be the actual parent of the container's process and do a wait.
containerd uses FIFO for reporting event and exit code and also for stdout and stdin
/run/containerd/fifo/195093460/<something_something>_stdout
/run/containerd/fifo/195093460/<something_something>_stdin
How ‘runc’ is used/executed inside containerd ?. Following are some explanation:
0 = {string} "-namespace"
1 = {string} "default"
2 = {string} "-address"
3 = {string} "/run/containerd/containerd.sock"
4 = {string} "-publish-binary"
5 = {string} "/tmp/___containerd"
6 = {string} "-id"
7 = {string} "u8"
8 = {string} "-debug"
9 = {string} "start"
Following are the log output (debug log were added to trace soure code) when ‘containerd-shim-runc-v2’ is running:
time="2020-01-02T23:38:00.438682328+11:00" level=info msg=setupDumpStacks...
time="2020-01-02T23:38:00.438927931+11:00" level=info msg="calling newServer..."
time="2020-01-02T23:38:00.439033554+11:00" level=info msg="registering ttrpc server"
time="2020-01-02T23:38:00.439097725+11:00" level=info msg="calling serve..."
time="2020-01-02T23:38:00.439198636+11:00" level=info msg="calling handleSignals..."
time="2020-01-02T23:38:00.439220268+11:00" level=info msg="nanik starting signal loop" namespace=default path=/run/containerd/io.containerd.runtime.v2.task/default/u67 pid=17162
time="2020-01-02T23:38:00.439861913+11:00" level=info msg="Create is called inside RegisterTaskService"
time="2020-01-02T23:38:00.439974576+11:00" level=info msg="container NANIK "
time="2020-01-02T23:38:00.440214591+11:00" level=info msg="--- CreateTaskRequest &CreateTaskRequest{ID:u67,Bundle:/run/containerd/io.containerd.runtime.v2.task/default/u67,Rootfs:[&types.Mo
unt{Type:overlay,Source:overlay,Target:,Options:[workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/133/work upperdir=/var/lib/containerd/io.containerd.snapshotter.
v1.overlayfs/snapshots/133/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/4/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/3/fs:/va
r/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/2/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs],XXX_unrecognized:[],}],Terminal:true,Stdin:/
run/containerd/fifo/383837931/u67-stdin,Stdout:/run/containerd/fifo/383837931/u67-stdout,Stderr:,Checkpoint:,ParentCheckpoint:,Options:&types1.Any{TypeUrl:containerd.runc.v1.Options,Value:[]
,XXX_unrecognized:[],},XXX_unrecognized:[],}"
time="2020-01-02T23:38:00.440329033+11:00" level=info msg="--- rootfs /run/containerd/io.containerd.runtime.v2.task/default/u67/rootfs"
time="2020-01-02T23:38:00.440350982+11:00" level=info msg="--- opts.BinaryName "
time="2020-01-02T23:38:00.440367236+11:00" level=info msg="--- opts.Bundle /run/containerd/io.containerd.runtime.v2.task/default/u67"
time="2020-01-02T23:38:00.441364775+11:00" level=info msg="--- calling p.create with context.Background.WithValue(type namespaces.namespaceKey, val default).WithValue(type metadata.mdOutgoin
gKey, val <not Stringer>).WithValue(type ttrpc.metadataKey, val <not Stringer>).WithValue(type shim.OptsKey, val <not Stringer>).WithValue(type log.loggerKey, val <not Stringer>).WithCancel.
WithCancel AND ... "
time="2020-01-02T23:38:00.441809346+11:00" level=info msg="------- inside init.gocontext.Background.WithValue(type namespaces.namespaceKey, val default).WithValue(type metadata.mdOutgoingKey
, val <not Stringer>).WithValue(type ttrpc.metadataKey, val <not Stringer>).WithValue(type shim.OptsKey, val <not Stringer>).WithValue(type log.loggerKey, val <not Stringer>).WithCancel.With
Cancel/run/containerd/io.containerd.runtime.v2.task/default/u67&{<nil> /run/containerd/io.containerd.runtime.v2.task/default/u67/init.pid 0xc000154520 false false false []}" runtime=io.conta
inerd.runc.v2
time="2020-01-02T23:38:00.503607036+11:00" level=info msg="Start is called inside RegisterTaskService"
time="2020-01-02T23:38:00.503640676+11:00" level=info msg="v2/service Start"
ime="2020-01-02T17:33:35.454812846+11:00" level=info msg="v2/service Delete"
The final function that will execute ‘runc’ is inside containerd/go-runc/runc.go
func (r *Runc) Create(context context.Context, id, bundle string, opts *CreateOpts) error {}
Logging code was added inside the Create(..) function and following is the output:
--- args [create --bundle /run/containerd/io.containerd.runtime.v2.task/default/u67]
--- cmd /home/nanik/AndroidProjects/docker/docker/runc --root /run/containerd/runc/default --log /run/containerd/io.containerd.runtime.v2.task/default/u67/log.json --log-format json create
--bundle /run/containerd/io.containerd.runtime.v2.task/default/u67 --pid-file /run/containerd/io.containerd.runtime.v2.task/default/u67/init.pid --console-socket /tmp/pty415594316/pty.sock u
67
The command used to execute 'runc' is as follows
"/home/nanik/AndroidProjects/docker/docker/runc --root /run/containerd/runc/default --log /run/containerd/io.containerd.runtime.v2.task/default/u67/log.json --log-format json create --bundle /run/containerd/io.containerd.runtime.v2.task/default/u67 --pid-file /run/containerd/io.containerd.runtime.v2.task/default/u67/init.pid --console-socket /tmp/pty415594316/pty.sock u67"
To use runc to see docker containers that are running
sudo ./runc --root /run/docker/runtime-runc/moby list
ID PID STATUS BUNDLE CREATED OWNER
f182f95645673b94af95495ea4c2a7c0f58dcce523f3d4e7174d7e482e136e08 12212 running /run/containerd/io.containerd.runtime.v1.linux/moby/f182f95645673b94af95495ea4c2a7c0f58dcce523f3d4e7174d7e482e136e08 2020-01-05T21:28:05.371871841Z root
The runtime (runc) uses so-called runtime root directory to store and obtain the information about containers. Under this root directory, runc places sub-directories (one per container), and each of them contains the state.json file, where the container state description resides.
The default location for runtime root directory is either /run/runc (for non-rootless containers) or $XDG_RUNTIME_DIR/runc (for rootless containers) - the latter also usually points to somewhere under /run (e.g. /run/user/$UID/runc).
When the container engine invokes runc, it may override the default runtime root directory and specify the custom one (--root option of runc). Docker uses this possibility, e.g. on my box, it specifies /run/docker/runtime-runc/moby as the runtime root.
That said, to make runc list see your Docker containers, you have to point it to Docker's runtime root directory by specifying --root option. Also, given that Docker containers are not rootless by default, you will need the appropriate privileges to access the runtime root (e.g. with sudo).
So, that's how this should work:
$ docker run -d alpine sleep 1000
4acd4af5ba8da324b7a902618aeb3fd0b8fce39db5285546e1f80169f157fc69
$ sudo runc --root /run/docker/runtime-runc/moby/ list
ID PID STATUS BUNDLE CREATED OWNER
4acd4af5ba8da324b7a902618aeb3fd0b8fce39db5285546e1f80169f157fc69 18372 running /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/4acd4af5ba8da324b7a902618aeb3fd0b8fce39db5285546e1f80169f157fc69 2019-07-12T17:33:23.401746168Z root
As to images, you can not make runc see them, as it has no notion of image at all - instead, it operates on bundles. Creating the bundle (e.g. based on image) is responsibility of the caller (in your case - containerd).
Docker CLI (docker) - /usr/bin/docker
Docker is used as a reference to the whole set of docker tools and at the beginning it was a monolith. But now docker-cli is only responsible for user friendly communication with docker.
So the command’s like docker build … docker run … are handled by Docker CLI and result in the invocation of dockerd API.
Dockerd - /usr/bin/dockerd
The Docker daemon - dockerd listens for Docker API requests and manages host’s Container life-cycles by utilizing contanerd
dockerd can listen for Docker Engine API requests via three different types of Socket: unix, tcp, and fd. By default, a unix domain socket is created at /var/run/docker.sock, requiring either root permission, or docker group membership. On Systemd based systems, you can communicate with the daemon via Systemd socket activation, use dockerd -H fd://.
There are many configuration options for the daemon, which are worth to check if you work with docker (dockerd).
My impression is that dockerd is here to serve all the features of Docker (or Docker EE) platform, while actual container life-cycle management is “outsourced” to containerd. Containerd
containerd - /usr/bin/docker-containerd
containerd was introduced in Docker 1.11 and since then took main responsibilty of managing containers life-cycle. containerd is the executor for containers, but has a wider scope than just executing containers. So it also take care of:
Image push and pull
Managing of storage
Of course executing of Containers by calling runc with the right parameters to run containers...
Managing of network primitives for interfaces
Management of network namespaces containers to join existing namespaces
containerd fully leverages the OCI runtime specification1, image format specifications and OCI reference implementation (runc). Because of its massive adoption, containerd is the industry standard for implementing OCI. It is currently available for Linux and Windows.
RunC - /usr/bin/docker-runc runc (OCI runtime) can be seen as component of containerd.
runc is a command line client for running applications packaged according to the OCI format and is a compliant implementation of the OCI spec.
Containers are configured using bundles. A bundle for a container is a directory that includes a specification file named “config.json” and a root filesystem. The root filesystem contains the contents of the container.
Assuming you have an OCI bundle you can execute the container
containerd-ctr - /usr/bin/docker-containerd-ctr (docker-)containerd-ctr
it’s barebone CLI (ctr) designed specifically for development and debugging purpose for direct communication with containerd. It’s included in the releases of containerd. By that less interesting for docker users.
containerd-shim - /usr/bin/docker-containerd-shim
The shim allows for daemonless containers. According to Michael Crosby it’s basically sits as the parent of the container’s process to facilitate a few things.
First it allows the runtimes, i.e. runc,to exit after it starts the container. This way we don't have to have the long running runtime processes for containers.
Second it keeps the STDIO and other fds open for the container in case containerd and/or docker both die. If the shim was not running then the parent side of the pipes or the TTY master would be closed and the container would exit.
Finally it allows the container's exit status to be reported back to a higher level tool like docker without having the be the actual parent of the container's process and do a wait.
Complete interaction between docker cli, dockerd, containerd, containerd-shim and runc
dockerd is sent POST Containers Create
↳ dockerd finds the requested image
↳ A container object is created and stored for future use
↳ Directories on the file system are setup for use by the container
dockerd is sent a POST Containers Start
↳ An OCI spec is created for the container
↳ containerd is contacted to create the container
↳ containerd stores the container spec in a database
↳ containerd is contacted to start the container
↳ containerd creates a task for the container
↳ The task uses a shim to call runc create
↳ containerd starts the task
↳ The task uses the shim to call runc start
↳ The shim / containerd continue to monitor the container until completion
This following is step-by-step example on how to run OCI compliant image using runc. We going to use docker in this example.
docker run -it ubuntu:latest /bin/bash
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ebfbbecaf715 ubuntu:latest "/bin/bash" 38 minutes ago Up 38 minutes zen_kirch
sudo env "PATH=$PATH" exportrootfs.sh -u 0 -r 65536 ebf
rootfs
├── bin
├── boot
├── dev
├── etc
├── home
├── lib
├── lib64
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin
├── srv
├── sys
├── tmp
├── usr
└── var
runc spec
Open config.json and modify the args to the following
"args": [
"/bin/bash"
],
Execute the image using the following
sudo env "PATH=$PATH" runc run anycontainername
You will see bash running
root@runc:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"
root@runc:/#
As can be seen the runc does not know how to pull, prepare, etc the image. It just knows that there is a root fileysystem with the config.json that it needs to run. The ubuntu container ran by the above example does not have network as this will be taken care by some other project and not by runc.
runc utilize prestart hooks to run some other application required as part of the setup of the containers, as shown in here. The config.json
.
.
.
"hooks": {
"prestart" : [
{
"path" : "/path/to/netns",
"args" : [
"",
"--state-dir", "/path/to/netns/netns-state"
]
}
]
},
.
.
.
shows the prestart hook that will be executed to setup the networking state using the netns executable. The netns tool is part of the genuinetools project