我们知道,以现在dockerd的架构,起容器需要有containerd,containerd-shim和容器进程(即容器主进程)三个进程。那么,这三个进程的依存关系如何?本次分析将介绍这方面的内容。
需要说明的是,由于不同shell中的内容并不是连贯执行的,所以进程号可能会不一致。
整体关系
首先,我们来看下containerd,containerd-shim和容器进程的关系:
| root 2156 1733 0 13:17 pts/0 00:00:00 ./bin/containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock —shim /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim —metrics-interval=0 —start-timeout 2m —state-dir /var/run/docker/libcontainerd/containerd —runtime docker-runc
root 2198 2156 0 13:45 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2214 2198 0 13:45 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
|
| —- |
可以看出,containerd是containerd-shim的父进程,contaienrd-shim是容器进程的父进程。
而杀死containerd进程后,contaienrd-shim和容器进程依然存在,只是containerd进程成孤儿进程后,被1进程接收了:
| root 2301 1 0 13:50 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2317 2301 1 13:50 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
|
| —- |
所以,为了简化三个进程的关系,我们从下面4种情况来分析:
- containerd进程存在的情况下,杀死containerd-shim进程;
- containerd进程存在的情况下,杀死容器进程;
- containerd进程不存在的情况下,杀死containerd-shim进程,然后启动containerd进程;
- containerd进程不存在的情况下,杀死容器进程,然后启动containerd进程;
第一种情况:containerd进程存在的情况下,杀死containerd-shim进程
containerd运行中,containerd-shim和容器进程如下:
| root 2414 2383 0 14:02 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2429 2414 1 14:02 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
|
| —- |
现在使用kill -9 2414
杀死cotnainerd-shim进程。
现在可以得出结论:容器进程退出。在containerd运行的情况下,杀死containerd-shim,容器进程会退出。
所以,现在来看下为什么容器进程会退出。
之前分析过,创建容器时会调用container的Start()方法,定义在containerd/runtime/container.go中:
| func (c container) Start(checkpointPath string, s Stdio) (Process, error) {
//**processRoot: /var/run/docker/libcontainerd/containerd/mynginx/init//
processRoot := filepath.Join(c.root, c.id, InitProcessID)
if err := os.Mkdir(processRoot, 0755); err != nil {
return nil, err
}
//构建cmd,调用的是containerd-shim//
//docker-containerd-shim nginx /home/fankang/mycontainer runc//
cmd := exec.Command(c.shim,
c.id, c.bundle, c.runtime,
)
cmd.Dir = processRoot
cmd.SysProcAttr = &syscall.SysProcAttr{
Setpgid: true,
}
//读取bundle目录下的config.json文件//
spec, err := c.readSpec()
if err != nil {
return nil, err
}
//InitProcessID = “init”//
config := &processConfig{
checkpoint: checkpointPath,
root: processRoot,
id: InitProcessID,
c: c,
stdio: s,
spec: spec,
processSpec: specs.ProcessSpec(spec.Process),
}
//*生成process//
p, err := newProcess(config)
if err != nil {
return nil, err
}
//*执行cmd*//
if err := c.createCmd(InitProcessID, cmd, p); err != nil {
return nil, err
}
return p, nil
}
|
| —- |
而Start()方法又会调用createCmd()方法执行命令:
| func (c container) createCmd(pid string, cmd exec.Cmd, p process) error {
p.cmd = cmd
//**执行cmd//
if err := cmd.Start(); err != nil {
close(p.cmdDoneCh)
if exErr, ok := err.(exec.Error); ok {
if exErr.Err == exec.ErrNotFound || exErr.Err == os.ErrNotExist {
return fmt.Errorf(“%s not installed on system”, c.shim)
}
}
return err
}
// We need the pid file to have been written to run
//*defer中执行//
defer func() {
//起一个go routine等待shim结束//
go func() {
//等待cmd执行完成//
err := p.cmd.Wait()
if err == nil {
p.cmdSuccess = true
}
//此处在调用ctr kill时或直接杀死shim进程时,会执行到,表明shim进程退出时所要做的处理//
//系统中进程的启动时间和内存中记录的时间比较,查看是否为同一process//
//此处如果是正常退出的话,则linux系统上进程已经不存在,所以linux系统上进程时间为空//
//如果是异常退出的话,如kill -9 shim进程,则linux系统上进程仍存在,此时same为true//
if same, err := p.isSameProcess(); same && p.pid > 0 {
// The process changed its PR_SET_PDEATHSIG, so force
// kill it
logrus.Infof(“containerd: %s:%s (pid %v) has become an orphan, killing it”, p.container.id, p.id, p.pid)
err = unix.Kill(p.pid, syscall.SIGKILL)
if err != nil && err != syscall.ESRCH {
logrus.Errorf(“containerd: unable to SIGKILL %s:%s (pid %v): %v”, p.container.id, p.id, p.pid, err)
} else {
for {
err = unix.Kill(p.pid, 0)
if err != nil {
break
}
time.Sleep(5 time.Millisecond)
}
}
}
close(p.cmdDoneCh)
}()
}()
//*等待进行创建完成*//
if err := c.waitForCreate(p, cmd); err != nil {
return err
}
c.processes[pid] = p
return nil
}
|
| —- |
可以看出,createCmd()在启动进程后,在defer中会起一个go routine,如果containerd-shim异常退出,那么cmd.wait()阻塞消除,如果容器进程存在,则执行unix.Kill(p.pid, syscall.SIGKILL)
操作杀死容器进程。
所以,containerd存在的情况下,手动杀死containerd-shim进程,容器进程将会被containerd中创建容器时留下的go routine杀死。
第二种情况:containerd进程存在的情况下,杀死容器进程
一方面,在容器进程退出时,containerd-shim也会捕获到信号退出,这将在第四种情况下详细分析。
另一方面,容器进程退出,containerd中的monitor会会捕获到该事件,从而触发容器进程退出流程,这是本小节详细分析的内容。
之前分析过,monitor会把容器退出事件放到monitor的exits channel中,在containerd/supevisor/monitor_linux.go中:
| func (m Monitor) start() {
var events [128]syscall.EpollEvent
for {
//**EpollWait()收集在epoll监控的事件中已经发送的事件//
n, err := archutils.EpollWait(m.epollFd, events[:], -1)
if err != nil {
if err == syscall.EINTR {
continue
}
logrus.WithField(“error”, err).Fatal(“containerd: epoll wait”)
}
// process events
for i := 0; i < n; i++ {
fd := int(events[i].Fd)
m.m.Lock()
r := m.receivers[fd]
switch t := r.(type) {
//process类型//
case runtime.Process:
if events[i].Events == syscall.EPOLLHUP {
delete(m.receivers, fd)
if err = syscall.EpollCtl(m.epollFd, syscall.EPOLL_CTL_DEL, fd, &syscall.EpollEvent{
Events: syscall.EPOLLHUP,
Fd: int32(fd),
}); err != nil {
logrus.WithField(“error”, err).Error(“containerd: epoll remove fd”)
}
if err := t.Close(); err != nil {
logrus.WithField(“error”, err).Error(“containerd: close process IO”)
}
EpollFdCounter.Dec(1)
//放入exits channel中//
m.exits <- t
}
//被OOM//
case runtime.OOM:
// always flush the event fd
t.Flush()
if t.Removed() {
delete(m.receivers, fd)
// epoll will remove the fd from its set after it has been closed
t.Close()
EpollFdCounter.Dec(1)
} else {
//放入到ooms channel中*//
m.ooms <- t.ContainerID()
}
}
m.m.Unlock()
}
}
}
|
| —- |
而在containerd的supervisor启动时,会启动eixthandler(),在containerd/supervisor/supervisor.go中:
| func New(stateDir string, runtimeName, shimName string, runtimeArgs []string, timeout time.Duration, retainCount int) (Supervisor, error) {
startTasks := make(chan startTask, 10)
if err := os.MkdirAll(stateDir, 0755); err != nil {
return nil, err
}
machine, err := CollectMachineInformation()
if err != nil {
return nil, err
}
monitor, err := NewMonitor()
if err != nil {
return nil, err
}
s := &Supervisor{
stateDir: stateDir,
containers: make(map[string]containerInfo),
startTasks: startTasks,
machine: machine,
subscribers: make(map[chan Event]struct{}),
tasks: make(chan Task, defaultBufferSize),
monitor: monitor,
runtime: runtimeName,
runtimeArgs: runtimeArgs,
shim: shimName,
timeout: timeout,
}
//**处理event日志**//
if err := setupEventLog(s, retainCount); err != nil {
return nil, err
}
go s.exitHandler()
go s.oomHandler()
if err := s.restore(); err != nil {
return nil, err
}
return s, nil
}
func (s Supervisor) exitHandler() {
for p := range s.monitor.Exits() {
e := &ExitTask{
Process: p,
}
s.SendTask(e)
}
}
|
| —- |
可以看到,exitHandler()会消费monitor exits channel中的事件,然后包装成ExitTask,然后发送到supervisor的tasks中以进一步处理。
所以,容器进程退出会触发containerd对容器进行exit处理。在exit处理中会调用delete处理,这些就不再细展开。
所以,containerd存在的情况下,杀死容器进程,conainerd-shim主动退出,containerd触发exit事件以清理该容器。
第三种情况:containerd进程不存在的情况下,杀死containerd-shim进程,然后启动containerd进程
现在容器在运行,containerd关闭,进程如下:
| root 2522 1 0 15:33 pts/0 00:00:00 /home/fankang/docker/containerd-0.2.4/src/github.com/docker/containerd/bin/containerd-shim nginx /home/fankang/mycontainer runc
root 2537 2522 0 15:33 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
|
| —- |
现在调用kill -9 2522
杀死2522。可以看到容器进程还在,成为孤儿进程,被进程1接收。
| root 2537 1 0 15:33 ? 00:00:00 /usr/bin/python /usr/bin/supervisord
root 2571 2537 0 15:33 ? 00:00:00 /usr/sbin/sshd -D
|
| —- |
启动containerd,容器进程消失。
所以containerd在启动时会清理残留的容器进程(对应的containerd-shim不存在)。
那么,这清理工作的流程是怎样的呢?supervisor在启动的时候会调用restore()方法,supervisor的restore()定义在containerd/supervisor/supervisor.go中:
| func (s *Supervisor) restore() error {
dirs, err := ioutil.ReadDir(s.stateDir)
if err != nil {
return err
}
for , d := range dirs {
if !d.IsDir() {
continue
}
id := d.Name()
container, err := runtime.Load(s.stateDir, id, s.shim, s.timeout)
if err != nil {
return err
}
processes, err := container.Processes()
if err != nil {
return err
}
ContainersCounter.Inc(1)
s.containers[id] = &containerInfo{
container: container,
}
if err := s.monitor.MonitorOOM(container); err != nil && err != runtime.ErrContainerExited {
logrus.WithField(“error”, err).Error(“containerd: notify OOM events”)
}
logrus.WithField(“id”, id).Debug(“containerd: container restored”)
var exitedProcesses []runtime.Process
for , p := range processes {
if p.State() == runtime.Running {
if err := s.monitorProcess(p); err != nil {
return err
}
} else {
exitedProcesses = append(exitedProcesses, p)
}
}
if len(exitedProcesses) > 0 {
// sort processes so that init is fired last because that is how the kernel sends the
// exit events
sortProcesses(exitedProcesses)
for _, p := range exitedProcesses {
e := &ExitTask{
Process: p,
}
s.SendTask(e)
}
}
}
return nil
}
|
| —- |
restore()会读取contaienrd主目录下各容器目录,调用runtime.Load()导入容器。如果容器不为runnning,则触发exit事件。
所以,现在的关键是看如何导入容器,runtime.Load()定义在containerd/runtime/container.go中:
| // Load return a new container from the matchin state file on disk.
func Load(root, id, shimName string, timeout time.Duration) (Container, error) {
var s state
//StateFile = “state.json”//
f, err := os.Open(filepath.Join(root, id, StateFile))
if err != nil {
return nil, err
}
defer f.Close()
if err := json.NewDecoder(f).Decode(&s); err != nil {
return nil, err
}
c := &container{
root: root,
id: id,
bundle: s.Bundle,
labels: s.Labels,
runtime: s.Runtime,
runtimeArgs: s.RuntimeArgs,
shim: s.Shim,
noPivotRoot: s.NoPivotRoot,
processes: make(map[string]process),
timeout: timeout,
}
if c.shim == “” {
c.shim = shimName
}
dirs, err := ioutil.ReadDir(filepath.Join(root, id))
if err != nil {
return nil, err
}
//**一个目录代表一个进程*//
for _, d := range dirs {
if !d.IsDir() {
continue
}
pid := d.Name()
s, err := readProcessState(filepath.Join(root, id, pid))
if err != nil {
return nil, err
}
p, err := loadProcess(filepath.Join(root, id, pid), pid, c, s)
if err != nil {
logrus.WithField(“id”, id).WithField(“pid”, pid).Debug(“containerd: error loading process %s”, err)
continue
}
c.processes[pid] = p
}
return c, nil
}
|
| —- |
在Load()中先通过loadProcess()导入容器目录下的进程。loadProcess()定义在containerd/runtime/process.go中:
| //从process.json中还原process//
func loadProcess(root, id string, c container, s ProcessState) (process, error) {
p := &process{
root: root,
id: id,
container: c,
spec: s.ProcessSpec,
stdio: Stdio{
Stdin: s.Stdin,
Stdout: s.Stdout,
Stderr: s.Stderr,
},
state: Stopped,
}
startTime, err := ioutil.ReadFile(filepath.Join(p.root, StartTimeFile))
if err != nil && !os.IsNotExist(err) {
return nil, err
}
p.startTime = string(startTime)
if _, err := p.getPidFromFile(); err != nil {
return nil, err
}
//**此处调用ExitStatus(),会走到handleSigkilledShim()的p.updateExitStatusFile(128 + uint32(syscall.SIGKILL))//
//即往exit中写入数据//
//在exit.go中调用ExitStatus()时,就可以提取exit中的数据*//
if _, err := p.ExitStatus(); err != nil {
if err == ErrProcessNotExited {
exit, err := getExitPipe(filepath.Join(root, ExitFile))
if err != nil {
return nil, err
}
p.exitPipe = exit
control, err := getControlPipe(filepath.Join(root, ControlFile))
if err != nil {
return nil, err
}
p.controlPipe = control
p.state = Running
return p, nil
}
return nil, err
}
return p, nil
}
|
| —- |
loadProcess()最重要的调用是p.ExitStatus(),如果出错,则状态为Running。所以琰看ExitStatus():
| //使用exit管道判断shim是否退出//
func (p *process) ExitStatus() (rst uint32, rerr error) {
data, err := ioutil.ReadFile(filepath.Join(p.root, ExitStatusFile))
defer func() {
if rerr != nil {
rst, rerr = p.handleSigkilledShim(rst, rerr)
}
}()
if err != nil {
if os.IsNotExist(err) {
return UnknownStatus, ErrProcessNotExited
}
return UnknownStatus, err
}
if len(data) == 0 {
return UnknownStatus, ErrProcessNotExited
}
p.stateLock.Lock()
p.state = Stopped
p.stateLock.Unlock()
i, err := strconv.ParseUint(string(data), 10, 32)
return uint32(i), err
}
|
| —- |
ExitStatus()会去读exit pipe。此时exit中没有数据,所以会出错。这里的ExitStatus()参数很特别,rerr先获取ExitStatus()主流程的错误,然后在defer中把rerr交给handleSigkilledShim()处理,最后把handleSigkilledShim()的结果错误作为rerr返回。现在流程会转移到handleSigkilledShim():
| func (p process) handleSigkilledShim(rst uint32, rerr error) (uint32, error) {
if p.cmd == nil || p.cmd.Process == nil
//**此处向容器进程发送0信号//
e := unix.Kill(p.pid, 0)
//第二次执行的时候,容器进程已经不存在,ESRCH表示参数 pid 所指定的进程或进程组不存在//
if e == syscall.ESRCH {
logrus.Warnf(“containerd: %s:%s (pid %d) does not exist”, p.container.id, p.id, p.pid)
// The process died while containerd was down (probably of
// SIGKILL, but no way to be sure)
return p.updateExitStatusFile(UnknownStatus)
}
// If it’s not the same process, just mark it stopped and set
// the status to the UnknownStatus value (i.e. 255)
if same, err := p.isSameProcess(); !same {
logrus.Warnf(“containerd: %s:%s (pid %d) is not the same process anymore (%v)”, p.container.id, p.id, p.pid, err)
// Create the file so we get the exit event generated once monitor kicks in
// without having to go through all this process again
return p.updateExitStatusFile(UnknownStatus)
}
ppid, err := readProcStatField(p.pid, 4)
if err != nil {
return rst, fmt.Errorf(“could not check process ppid: %v (%v)”, err, rerr)
}
//容器进程为1,则表明容器的守护进程shim意外退出//
if ppid == “1” {
logrus.Warnf(“containerd: %s:%s shim died, killing associated process”, p.container.id, p.id)
//真正杀死容器进程的地方**//
unix.Kill(p.pid, syscall.SIGKILL)
if err != nil && err != syscall.ESRCH {
return UnknownStatus, fmt.Errorf(“containerd: unable to SIGKILL %s:%s (pid %v): %v”, p.container.id, p.id, p.pid, err)
}
// wait for the process to die
for {
e := unix.Kill(p.pid, 0)
if e == syscall.ESRCH {
break
}
time.Sleep(5 time.Millisecond)
}
// Create the file so we get the exit event generated once monitor kicks in
// without having to go through all this process again
return p.updateExitStatusFile(128 + uint32(syscall.SIGKILL))
}
return rst, rerr
}
// Possible that the shim was SIGKILLED
e := unix.Kill(p.cmd.Process.Pid, 0)
if e != syscall.ESRCH {
return rst, rerr
}
// Ensure we got the shim ProcessState
<-p.cmdDoneCh
shimStatus := p.cmd.ProcessState.Sys().(syscall.WaitStatus)
if shimStatus.Signaled() && shimStatus.Signal() == syscall.SIGKILL {
logrus.Debugf(“containerd: ExitStatus(container: %s, process: %s): shim was SIGKILL’ed reaping its child with pid %d”, p.container.id, p.id, p.pid)
rerr = nil
rst = 128 + uint32(shimStatus.Signal())
p.stateLock.Lock()
p.state = Stopped
p.stateLock.Unlock()
}
return rst, rerr
}
|
| —- |
handleSigkilledShim()的if p.cmd == nil || p.cmd.Process == nil
流程如下:
- 如果容器进程不存在,则返回;
- 如果容器进程发生改变,则交由monitor处理,返回;
- 如果容器进程的父进程为1,则表明shim退出,杀死容器进程,并调用updateExitStatusFile()把内容写到exit,返回;
- 返回。
现在,按我们分析的流程,handleSigkilledShim()将运行到步骤3。由于ExitStatus()的rerr接收了handleSigkilledShim()的返回值,所以rerr为nil,所以process的状态不为running。
所以supervisor的restore()会对该容器作exit操作。
exit操作中也会调用ExitStatus(),但此时exit中是有内容的;也会走到handleSigkilledShim()流程,但会在步骤1就返回,因为容器进程在之前的流程中已经被删除。
如果容器中containerd-shim和容器进程都存在,则从步骤4返回。
第四种情况:containerd进程不存在的情况下,杀死容器进程,然后启动containerd进程
杀死容器进程,containerd-shim进程主动退出。containerd在restore()中对该容器做exit操作。
这时提供一个demo,来看下go语言使用exec包启动进程的方法:
| package main
import (
“os”
“os/signal”
“os/exec”
“syscall”
)
func main() {
signals := make(chan os.Signal, 2048)
signal.Notify(signals)
cmd1 := exec.Command(“/bin/sh”, “-c”, “sleep 50”)
cmd1.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd1.Start()
cmd2 := exec.Command(“/bin/sh”, “-c”, “sleep 50”)
cmd2.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd2.Start()
select {
case <-signals:
syscall.Kill(-cmd1.Process.Pid, syscall.SIGKILL)
syscall.Kill(-cmd2.Process.Pid, syscall.SIGKILL)
}
}
|
| —- |
编译执行的结果如下:
| root 5838 1733 0 17:16 pts/0 00:00:00 ./test
root 5843 5838 0 17:16 pts/0 00:00:00 /bin/sh -c sleep 50
root 5844 5838 0 17:16 pts/0 00:00:00 /bin/sh -c sleep 50
|
| —- |
执行kill 5843
后,所有进程都不存在。