作者您好,我是《自己动手写Docker》的一名读者,也是蚂蚁的一名校招生(花名昼梦,之前实习过一段时间),现在是大四,最近正在尝试自己实现一个类似于runC的容器运行时,但是在实现cgroups时遇到了一些问题。
- 首先描述一下parent process(runc run进程)和child process(runc init进程)的交互:
在parent process执行了cmd.Start(),拿到了init process的PID,然后将处理Cgroups,以memory subsystem为例,会将创建一个/sys/fs/cgroup/memory/$container_id这样一个目录,然后在tasks文件中写入PID,在memory.limit_in_bytes中写入限制的内存大小。然后处理其他的初始化,最后将InitConfig通过管道传输给init process。init process拿到config后会执行一系列的初始化,比如pivot_root等,最后发给parent process一个信号,通知父进程自己已经初始化完毕了,等待执行syscall.Exec(即真正的命令),然后parent process在收到该信号后,如果是run操作(即create+start),再发一个信号给init process,让init process执行最后的syscall.Exec。
- 我遇到的问题是,在tasks写入的PID,在init process执行syscall.Exec之前都是存在的,但是在init process执行syscall.Exec之后,这个cgroup目录还存在,但是tasks文件会变为空,并且内存限制也没有生效。
- 相关的代码:
- parent process:
https://github.com/songxinjianqwe/rune/blob/master/libcapsule/parent_process_init_impl.go
- init process:
https://github.com/songxinjianqwe/rune/blob/master/libcapsule/initializer_standard_impl.go
- 我也运行了您书中的这个示例程序:https://github.com/songxinjianqwe/go-practice/blob/master/os/docker/cgroup/cgroup.go
这个程序内存限制是生效的,并且tasks文件在syscall.Exec之后也不是空的。
我推测可能跟我代码中一些其他的初始化操作有关。
我自己一直以来是做Java开发的,对Linux编程不是特别熟悉,但是实习期间接触了云计算的一些知识,产生了一些兴趣,打算把手写runC作为本科毕设。
我的钉钉二维码在下面,希望能和您有深入的交流。

补充一点,在stress进程(syscall.Exec)启动后,在tasks文件为空的情况下,我手动给tasks文件写入了PID,但内存限制仍未生效(但是memory.limit_in_bytes里面是写成功了的)。
如果您想自己运行一下的话,可以使用类似于runC的方式,建一个目录,目录里放一个rootfs目录(里面是容器rootfs),还有一个config.json文件。
/mycontainer/├── rootfs/│ ├── bin -> usr/bin/│ ├── dev/│ ├── etc/│ ├── home/│ ├── lib -> usr/lib/│ ├── lib64 -> usr/lib64/│ ├── media/│ ├── mnt/│ ├── opt/│ ├── proc/│ ├── root/│ ├── run/│ ├── sbin -> usr/sbin/│ ├── srv/│ ├── sys/│ ├── tmp/│ ├── usr/│ └── var/├── config.json
{"ociVersion": "1.0.1-dev","process": {"user": {"uid": 0,"gid": 0},"args": ["stress","--vm-bytes","256m","--vm-keep","-m","1"],"env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin","TERM=xterm"],"cwd": "/"},"root": {"path": "rootfs","readonly": true},"hostname": "rune","mounts": [{"destination": "/proc","type": "proc","source": "proc"},{"destination": "/dev","type": "tmpfs","source": "tmpfs","options": ["nosuid","strictatime","mode=755","size=65536k"]},{"destination": "/dev/pts","type": "devpts","source": "devpts","options": ["nosuid","noexec","newinstance","ptmxmode=0666","mode=0620","gid=5"]},{"destination": "/dev/shm","type": "tmpfs","source": "shm","options": ["nosuid","noexec","nodev","mode=1777","size=65536k"]},{"destination": "/dev/mqueue","type": "mqueue","source": "mqueue","options": ["nosuid","noexec","nodev"]},{"destination": "/sys","type": "sysfs","source": "sysfs","options": ["nosuid","noexec","nodev","ro"]}],"linux": {"resources": {"devices": [{"allow": false,"access": "rwm"}],"memory": {"limit": 102400},"cpu": {"shares": 10}},"namespaces": [{"type": "pid"},{"type": "network"},{"type": "ipc"},{"type": "uts"},{"type": "mount"}]}}
您好,我在阅读runC的代码时发现cgroup manager#apply时有时间会将PID写入到tasks文件,有时候会写入到cgroup.procs文件中,于是我尝试将代码改为写入到cgroup.procs中,问题就解决了。
tasks: list of tasks (by PID) attached to that cgroup. This list is not guaranteed to be sorted. Writing a thread ID into this file moves the thread into this cgroup. cgroup.procs: list of thread group IDs in the cgroup. This list is not guaranteed to be sorted or free of duplicate TGIDs, and userspace should sort/uniquify the list if this property is required. Writing a thread group ID into this file moves all threads in that group into this cgroup.
runC代码地址https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/fs/apply_raw.go
感谢您的阅读。其实我提问还是着急了一些,自己多研究一下还是可以解决大部分问题的。
