
概述
Flannel 最早由CoreOS开发,随着CNI概念兴起,flannel也是最早实现CNI的标准网络插件(CNI标准也是由CoreOS提出的)。
- flanneld启动时从etcd中读取配置,并请求获取一个subnet lease(租约),有效期是24小时,并且监视etcd的数据更新。flanneld一旦获取subnet租约、配置完backend,会将获取的信息写入/run/flannel/subnet.env文件。
- Flanneld 监听Etcd上的数据变化,一旦有更新,会添加本地的路由表。
可以看到这里的MTU值并不是以太网规定的1500,因为外层的Vxlan封包还要占据 50 Byte。~]# cat /run/flannel/subnet.envFLANNEL_NETWORK=10.243.0.0/16FLANNEL_SUBNET=10.243.32.1/24FLANNEL_MTU=1450FLANNEL_IPMASQ=true
常见参数
- —ip-masq=true
这个参数的目的是让flannel进行ip伪装,而不让docker进行ip伪装。这么做的原因是如果docker进行ip伪装,流量再从flannel出去,其他host上看到的source ip就是flannel的网关ip,而不是docker容器的ip。
Node1的flannel.1 和 docker0的地址
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group defaultlink/ether 26:93:3d:21:b1:95 brd ff:ff:ff:ff:ff:ffinet 10.243.32.0/32 scope global flannel.1valid_lft forever preferred_lft foreverinet6 fe80::2493:3dff:fe21:b195/64 scope linkvalid_lft forever preferred_lft forever4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group defaultlink/ether 02:42:5a:d9:50:1c brd ff:ff:ff:ff:ff:ffinet 10.243.32.1/24 brd 10.243.32.255 scope global docker0valid_lft forever preferred_lft foreverinet6 fe80::42:5aff:fed9:501c/64 scope linkvalid_lft forever preferred_lft forever
Pod1 的路由
~]# kubectl exec -it -n dev tst-864bddb8bd-6p56g -- ip route
default via 10.243.32.1 dev eth0
10.243.32.0/24 dev eth0 proto kernel scope link src 10.243.32.28
Pod1 中数据包出去走 default 路由,也就是docker0 bridge的地址
Node1的路由
Flannel如何与docker集成
通过下面两个文件,我们可以看到,Docker在启动的时候,systemd service文件中,加载了 /run/flannel/docker 这个文件中的内容,指定了 $DOCKER_NETWORK_OPTIONS 这个变量,docker在启动的时候显式的指定了--bip 这个参数。
例如 --bip=10.243.32.1/24 限制了所在节点容器获得的IP范围,以确保每个节点上的Docker使用不同的IP段。
docker service文件内容如下:
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=-/run/flannel/docker
WorkingDirectory=/usr/local/bin
ExecStart=/usr/bin/dockerd --exec-opt native.cgroupdriver=systemd -H unix:///var/run/docker.sock -H tcp://0.0.0.0:2376 $DOCKER_NETWORK_OPTIONS
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=102400
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
/run/flannel/docker文件内容如下:
~]# cat /run/flannel/docker
DOCKER_OPT_BIP="--bip=10.243.32.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=false"
DOCKER_OPT_MTU="--mtu=1450"
DOCKER_NETWORK_OPTIONS=" --bip=10.243.32.1/24 --ip-masq=false --mtu=1450"
查看的docker启动的参数
~]# ps -ef | grep dockerd
root 2054 1 5 Jun02 ? 8-11:44:58 /usr/bin/dockerd --exec-opt native.cgroupdriver=systemd -H unix:///var/run/docker.sock -H tcp://0.0.0.0:2376 --bip=10.243.32.1/24 --ip-masq=false --mtu=1450
然后flannel会在每个节点上创建flannel.1的vxlan设备:
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 26:93:3d:21:b1:95 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 10.143.143.52 dev eth0 srcport 0 0 dstport 8472 nolearning ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 g
fannel.1网卡IP如下:
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.243.32.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::2493:3dff:fe21:b195 prefixlen 64 scopeid 0x20<link>
ether 26:93:3d:21:b1:95 txqueuelen 0 (Ethernet)
RX packets 64343908 bytes 124967330413 (116.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38028400 bytes 22204216947 (20.6 GiB)
TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0s
flannel.1为vxlan设备,linux kernel可以自动识别,并将上面的packet进行vxlan封包处理。在这个封包过程中,kernel需要知道该数据包究竟发到哪个node上去。kernel需要查看node上的fdb(forwarding database)以获得上面对端vtep设备(已经从arp table中查到其mac地址:d6:51:2e:80:5c:69)所在的node地址。如果fdb中没有这个信息,那么kernel会向用户空间的flanned程序发起”L2 MISS”事件。flanneld收到该事件后,会查询etcd,获取该vtep设备对应的node的”Public IP“,并将信息注册到fdb中。
什么FDB表
FDB表:Forwarder DataBase Linux网桥维护了一个2层转发表(也称为MAC学习表,转发数据库,或者仅仅称为FDB),它跟踪记录了MAC地址与端口的对应关系。
~]# bridge fdb show dev flannel.1
8a:04:c4:9a:0c:9f dst 10.143.143.150 self permanent
f6:7d:27:8c:d3:ed dst 10.143.143.84 self permanent
fa:25:00:77:1c:9d dst 10.143.143.111 self permanent
源码解析
Main函数

Flannel源码 Main 函数中主要有如下操作
- newSubnetManager 通过GetBackend,创建SubnetManager,主要是用来管理subnet的。
- NewManager 创建Backend,通过该Backend来注册subnet
- RegisterNetwork 创建Backend,通过该Backend来注册subnet
- WriteSubnetFile 写入subnet file
- RunBackend
通过GetBackend 确认 Backend 的类型,确认是Vxlan、Udp、Host-gw等,然后进行RegisterNetwork。
flannel/main.go newSubnetManager代码如下:
默认数据存储类型为etcd,如果flanneld启动时指定”-kube-subnet-mgr”,则使用kubernetes的API存储数据。
func newSubnetManager() (subnet.Manager, error) {
if opts.kubeSubnetMgr {
return kube.NewSubnetManager(opts.kubeApiUrl, opts.kubeConfigFile, opts.kubeAnnotationPrefix, opts.netConfPath)
}
cfg := &etcdv2.EtcdConfig{
Endpoints: strings.Split(opts.etcdEndpoints, ","),
Keyfile: opts.etcdKeyfile,
Certfile: opts.etcdCertfile,
CAFile: opts.etcdCAFile,
Prefix: opts.etcdPrefix,
Username: opts.etcdUsername,
Password: opts.etcdPassword,
}
// Attempt to renew the lease for the subnet specified in the subnetFile
prevSubnet := ReadCIDRFromSubnetFile(opts.subnetFile, "FLANNEL_SUBNET")
return etcdv2.NewLocalManager(cfg, prevSubnet)
}
Main函数中几个关键的步骤
NewManager
返回了一个Manager对象
type manager struct {
ctx context.Context
sm subnet.Manager
extIface *ExternalInterface
mux sync.Mutex
active map[string]Backend
wg sync.WaitGroup
}
// 只是返回了一个manager对象
func NewManager(ctx context.Context, sm subnet.Manager, extIface *ExternalInterface) Manager {
return &manager{
ctx: ctx,
sm: sm,
extIface: extIface,
active: make(map[string]Backend),
}
}
通过GetBackend,获取Backend的Type,进行RegisterNetwork。
WriteSubnetFile
func WriteSubnetFile(path string, nw ip.IP4Net, ipMasq bool, bn backend.Network) error {
dir, name := filepath.Split(path)
os.MkdirAll(dir, 0755)
tempFile := filepath.Join(dir, "."+name)
f, err := os.Create(tempFile)
if err != nil {
return err
}
// Write out the first usable IP by incrementing
// sn.IP by one
sn := bn.Lease().Subnet
sn.IP += 1
fmt.Fprintf(f, "FLANNEL_NETWORK=%s\n", nw)
fmt.Fprintf(f, "FLANNEL_SUBNET=%s\n", sn)
fmt.Fprintf(f, "FLANNEL_MTU=%d\n", bn.MTU())
_, err = fmt.Fprintf(f, "FLANNEL_IPMASQ=%v\n", ipMasq)
f.Close()
if err != nil {
return err
}
// rename(2) the temporary file to the desired location so that it becomes
// atomically visible with the contents
return os.Rename(tempFile, path)
//TODO - is this safe? What if it's not on the same FS?
}
Vxlan网络,HandleSubnet
switch event.Type {
case subnet.EventAdded:
if directRoutingOK {
log.V(2).Infof("Adding direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
if err := netlink.RouteReplace(&directRoute); err != nil {
log.Errorf("Error adding route to %v via %v: %v", sn, attrs.PublicIP, err)
continue
}
} else {
log.V(2).Infof("adding subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
//添加ARP表
if err := nw.dev.AddARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("AddARP failed: ", err)
continue
}
//添加FDB表
if err := nw.dev.AddFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("AddFDB failed: ", err)
// Try to clean up the ARP entry then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
continue
}
// Set the route - the kernel would ARP for the Gw IP address if it hadn't already been set above so make sure
// this is done last.
if err := netlink.RouteReplace(&vxlanRoute); err != nil {
log.Errorf("failed to add vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
// Try to clean up both the ARP and FDB entries then continue
if err := nw.dev.DelARP(neighbor{IP: event.Lease.Subnet.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
if err := nw.dev.DelFDB(neighbor{IP: event.Lease.Attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}
continue
}
}
//删除事件
case subnet.EventRemoved:
if directRoutingOK {
log.V(2).Infof("Removing direct route to subnet: %s PublicIP: %s", sn, attrs.PublicIP)
if err := netlink.RouteDel(&directRoute); err != nil {
log.Errorf("Error deleting route to %v via %v: %v", sn, attrs.PublicIP, err)
}
} else {
log.V(2).Infof("removing subnet: %s PublicIP: %s VtepMAC: %s", sn, attrs.PublicIP, net.HardwareAddr(vxlanAttrs.VtepMAC))
// Try to remove all entries - don't bail out if one of them fails.
//删除ARP表
if err := nw.dev.DelARP(neighbor{IP: sn.IP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelARP failed: ", err)
}
//删除FDB表
if err := nw.dev.DelFDB(neighbor{IP: attrs.PublicIP, MAC: net.HardwareAddr(vxlanAttrs.VtepMAC)}); err != nil {
log.Error("DelFDB failed: ", err)
}
//删除路由
if err := netlink.RouteDel(&vxlanRoute); err != nil {
log.Errorf("failed to delete vxlanRoute (%s -> %s): %v", vxlanRoute.Dst, vxlanRoute.Gw, err)
}
}
default:
log.Error("internal error: unknown event type: ", int(event.Type))
}
}
}
遇到的问题
暴露Pod IP给集群外部访问
在服务迁移至容器云平台过程中,服务提供方部署在Kubernetes上,将Pod IP注册到注册中心,服务调用方访问注册中心,拿到提供方的 IP 后,却发现这个IP并不可达,因此严重影响了业务方接入Kubernetes。
解决方案:
我们在三层交换机上针对Flannel加了聚合路由,将Pod IP段下一跳指向物理机IP,把Pod和物理机、虚拟机放在同一平面下,实现了Pod IP和物理机、虚拟机互通。
多机房互通
为了实现跨机房的互通,两个集群的Flannel连接到同一个etcd集群,这样保障网络配置的一致性。老版本的Flannel存在很多问题,包括:路由条数过多,ARP表缓存失效等问题。建议修改成网段路由的形式,并且设置ARP规则永久有效,避免因为etcd等故障导致集群网络瘫痪。
修改租约
Flannel的使用还需要注意一些配置优化,默认情况下每天都会申请etcd的租约,如果申请失败会删除etcd网段信息。为了避免网段变化,我们将etcd数据节点的ttl置为0(永不过期),设置ttl的命令如下:
etcdctl --endpoint=https://10.100.139.246:2379 \
set -ttl 0 /coreos.com/network/subnets/10.254.50.0-24 \
$(etcdctl --endpoint=http://10.100.139.246:2379 set -ttl 0 /coreos.com/network/subnets/10.254.50.0-24)
