Containerd - containerd for Ops and Admins - 《开源翻译计划》

Systemd
Base Configuration
Plugin Configuration
- Linux Runtime Plugin
- Blot Metadata Plugin

containerd 是一个简单的守护进程，可以在任何系统上运行。它提供了一个最小配置，来配置这个守护进程和必要的时候使用什么插件

containerd --help
NAME:
   containerd -
                    __        _                     __
  _________  ____  / /_____ _(_)___  ___  _________/ /
 / ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __  /
/ /__/ /_/ / / / / /_/ /_/ / / / / /  __/ /  / /_/ /
\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/   \__,_/
high performance container runtime
USAGE:
   containerd [global options] command [command options] [arguments...]
VERSION:
   1.4.6
DESCRIPTION:
containerd is a high performance container runtime whose daemon can be started
by using this command. If none of the *config*, *publish*, or *help* commands
are specified, the default action of the **containerd** command is to start the
containerd daemon in the foreground.
A default configuration is used if no TOML configuration is specified or located
at the default file location. The *containerd config* command can be used to
generate the default configuration for containerd. The output of that command
can be used and modified as necessary as a custom configuration.
COMMANDS:
   config    information on the containerd config
   publish   binary to publish events to containerd
   oci-hook  provides a base for OCI runtime hooks to allow arguments to be injected.
   help, h   Shows a list of commands or help for one command
GLOBAL OPTIONS:
   --config value, -c value     path to the configuration file (default: "/etc/containerd/config.toml")
   --log-level value, -l value  set the logging level [trace, debug, info, warn, error, fatal, panic]
   --address value, -a value    address for containerd's GRPC server
   --root value                 containerd root directory
   --state value                containerd state directory
   --help, -h                   show help
   --version, -v                print the version

一些守护进程级别的选项可以通过 flags 进行配置，主要的 containerd 配置被保存在配置文件中，默认的配置文件路径在 /etc/containerd/config.tom，可以通过 --config, -c flags 在启动 daemon 的时候配置这个路径。

Systemd

如果你使用 systemd 作为你的初始化系统（在大多数现代的 linux 系统上是），service 文件需要一些修改

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Delegate=yes
KillMode=process
[Install]
WantedBy=multi-user.target

Delegate=yes 和 KillMode=yes 是两项需要在 Service 选项中配置的最重要的改变

Delegate 允许 containerd 和它的 runtimes 控制它创建的容器的 cgroups。如果没有设置这个参数，systemd 会尝试将进程移动到它自己的 cgroups 中，这会导致 containerd 和它的 runtimes 无法正确的说明容器的资源用量。

在 containerd 被关闭的时候，默认情况下，systemd 将查看其 named cgroup 并杀死它知道的每个服务进程。这不是我们想要的。我们像能够升级 containerd 并且在中断的时候允许现存的容器保持运行。设置 KillMode 来保证 systemd 只杀死 containerd 的守护进程而不包括子进程 shims 和容器。

下面的 system-run 命令会以同样的方式打开 containerd

sudo systemd-run -p Delegate=yes -p KillMode=process /usr/local/bin/containerd

Base Configuration

在 containerd 的配置文件中，分为持久化和运行时存储，像 grpc，debug 和 metrics 的地址，有多种 API

只有一小部分设置是比较重要的，首先是 oom_score，因为 containerd 会管理很多容器，我们需要确保在 containerd daemon 进入内存不足的状态之前杀掉容器。我们也不想让 containerd 被杀掉，但是我们想要降低它的相对于其他系统守护进程的评分。

containerd 通过 Promethues metric 的形式（/v1/metrics）暴露自己的 metrics，等级与容器的等级一致。现在， Promethues 只支持 TCP 端口的形式，所以 metrics 地址应该是一个 TCP 地址，让你的 Promethues 基础设置能够抓取到 metrics。

containerd 在一个主机上有两种不同的存储位置。一种是持久化存储，另一种是运行时存储。

root会被用来存储任何类型的持久化数据，快照，content，容器和镜像的 metadata，还有插件数据。 root 对于容器加载的插件也是 namespaced。每一个插件会有自己存储数据的文件夹，containerd 自己不持有任何他自己需要的持久化数据，它的功能来自于加载的插件。

/var/lib/containerd/
├── io.containerd.content.v1.content
│   ├── blobs
│   └── ingest
├── io.containerd.metadata.v1.bolt
│   └── meta.db
├── io.containerd.runtime.v1.linux
│   ├── default
│   └── example
├── io.containerd.snapshotter.v1.btrfs
└── io.containerd.snapshotter.v1.overlayfs
    ├── metadata.db
    └── snapshots

state 将会被用于存储任何类型的暂时数据，sockets，pid，运行时状态，挂载点，还有其他不需要持久化的插件数据。

/run/containerd
├── containerd.sock
├── debug.sock
├── io.containerd.runtime.v1.linux
│   └── default
│       └── redis
│           ├── config.json
│           ├── init.pid
│           ├── log.json
│           └── rootfs
│               ├── bin
│               ├── data
│               ├── dev
│               ├── etc
│               ├── home
│               ├── lib
│               ├── media
│               ├── mnt
│               ├── proc
│               ├── root
│               ├── run
│               ├── sbin
│               ├── srv
│               ├── sys
│               ├── tmp
│               ├── usr
│               └── var
└── runc
    └── default
        └── redis
            └── state.json

root 和 state 文件夹对于插件都是 namespaced。这两个目录都是 containerd 和其插件的实现细节，它们不应该被篡改，否则会发生错误和 bug。当 containerd 或者插件清理资源时，外部的 APP 读取或者监听这些文件夹的变化会导致 EBUSY 和陈旧的文件句柄（EBUSY and stale file handles）。

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"
# set containerd's OOM score
oom_score = -999
# grpc configuration
[grpc]
  address = "/run/containerd/containerd.sock"
  # socket uid
  uid = 0
  # socket gid
  gid = 0
# debug configuration
[debug]
  address = "/run/containerd/debug.sock"
  # socket uid
  uid = 0
  # socket gid
  gid = 0
  # debug level
  level = "info"
# metrics configuration
[metrics]
  # tcp address!
  address = "127.0.0.1:1234"

Plugin Configuration

归根结底，containerd 的核心是非常小的，真正的功能来自于插件。快照，运行时和 content 的所有都是在运行时注册的插件。因为这些各种各样的插件之间的不同，我们需要给插件提供类型安全的配置，我们可以做的唯一方式就是通过配置文件而不是 flags。

在配置文件中，你可以指定插件的等级选项，通过 plugin.<name>，你将会不得不读取插件的特殊文档，来找到你插件需要的选项。

Linux Runtime Plugin

linux 运行时允许设置一些选项来配置你所使用的 shim 和 runtime

[plugins.linux]
    # shim binary name/path
    shim = ""
    # runtime binary name/path
    runtime = "runc"
    # do not use a shim when starting containers, saves on memory but
    # live restore is not supported
    no_shim = false
    # display shim logs in the containerd daemon's log output
    shim_debug = true

Blot Metadata Plugin

Bolt 元数据插件允许配置命名空间之间的内容共享策略。

默认模式“共享”将使 blob 在被拉入任何命名空间后在所有命名空间中可用。如果使用后端中已经存在的“预期”摘要打开编写器，则 blob 将被拉入命名空间。

替代模式“隔离”要求客户端通过在将 blob 添加到命名空间之前将所有内容提供给摄取来证明他们有权访问内容。

两种模式共享备份数据，而“共享”将减少跨命名空间的总带宽，代价是仅通过了解其摘要就允许访问任何 blob。

默认为“共享”。虽然这在很大程度上是最需要的策略，但可以使用以下配置更改为“隔离”模式：

[plugins.bolt]
    content_sharing_policy = "isolated"