cilium 网络组件 - cluster mesh 集群间网络互访 - 《AI容器平台分享》

背景
配置 cluster mesh
- Cilium ds 上 cluster 配置项目
- Cilium configmap 上需要有
配置Secretfile
测试
- Verify clustermesh syncing
参考

背景

实现 GPU 集群和 CPU 集群，计划两个集群Pod IP 之间可以直接访问。GPU集群跨了两个网络段的。在CPU gateway 节点配置路由规则，路由在同一个主机网络GPU机器。加路由规则比较麻烦，由于GPU机器跨两个网络配置集中部分机器，配置不方便。

两个集群网络直接互联。自动根据对端ETCD网络段变化添加更新集群节点本地路由规则。

配置 cluster mesh

Cilium ds 上 cluster 配置项目

Cilum-ds 上面确认有下面配置[1.2.5] 是在ds上面， 1.8 1.9 版本在cilium-operator-deploy上面

            - name: CILIUM_CLUSTERMESH_CONFIG
              value: "/var/lib/cilium/clustermesh/"
            - name: CILIUM_CLUSTER_NAME
              valueFrom:
                configMapKeyRef:
                  key: cluster-name
                  name: cilium-config
                  optional: true
            - name: CILIUM_CLUSTER_ID
              valueFrom:
                configMapKeyRef:
                  key: cluster-id
                  name: cilium-config
                  optional: true
           volumeMounts:
             - name: clustermesh-secrets
              mountPath: /var/lib/cilium/clustermesh
              readOnly: true
      volumes:
          ....
          - name: clustermesh-secrets
          secret:
            defaultMode: 420
            optional: true
            secretName: cilium-clustermesh

CILIUM_CLUSTERMESH_CONFIG: 挂载 etcd 证书位置

CILIUM_CLUSTER_ID ： mesh id, 每个节点都必须唯一

注意：

     Cluster ID不要随意修改，如果修改以后导致就 workload【pod】 无法访问，需要重新启动以后才可以访问。必须慎重。

Cilium configmap 上需要有

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
    ####
    # .....
    ####
    # 增加下面两个配置
    cluster-name: "cluster<id>" # 命名 需要集群唯一
    cluster-id: "<id>" # id: 1 ~ 255 全集群唯一

配置Secretfile

每个集群需要四个问题，以集群cluster-1和cluster2集群互联为例子。cluster-1， cluster-2 必须和填写在cluster-name名字一致，否则互联失败。

集群描述文件，描述就此集群配置文件路径
cluster1

endpoints:
  - https://172.xx.xx.xx:2379  # IP以实际具体为准
ca-file:   '/var/lib/cilium/clustermesh/cluster2.etcd-client-ca.crt'
key-file:  '/var/lib/cilium/clustermesh/cluster2.etcd-client.key'
cert-file: '/var/lib/cilium/clustermesh/cluster2.etcd-client.crt'

cluster1， cluster1.etcd-client.key， cluster1.etcd-client.crt, cluster1.etcd-client-ca.crt 连接 etcd 三个key文件

注意： TLS和集群etcd配置必须按照这个命名规则

    - 配置文件：<cluster-name>
    - TLS相关文件 <cluster-name>.etcd-client.key, <cluster-name>.etcd-client.crt, <cluster-name>.etcd-client-ca.crt

加入文件到cluster2集群上面

kubectl --debug create secret generic  -n kube-system --from-file=./cluster1 --from-file=./cluster1.etcd-client-ca.crt --from-file=./cluster1.etcd-client.key --from-file=./cluster1.etcd-client.crt

注意 cluster1 配置到cluster2, 统一cluster2文件配置cluster1上。配置以后才可以进行互相访问。

挂载以后如下：

(cluster2 cilium container xx) $ ls /var/lib/cilium/clustermesh/
cluster
cluster1.etcd-client-ca.crt
cluster1.etcd-client.crt
cluster1.etcd-client.key
(cluster2 node1) $ cat /var/lib/cilium/clustermesh/cluster1
endpoints:
  - https://172.xx.xx.xx:2379  # IP以实际具体为准
ca-file:   '/var/lib/cilium/clustermesh/cluster1.etcd-client-ca.crt'
key-file:  '/var/lib/cilium/clustermesh/cluster1.etcd-client.key'
cert-file: '/var/lib/cilium/clustermesh/cluster1.etcd-client.crt'

测试

Verify clustermesh syncing

Check cluster status:

(cluster1 node1) $ cilium status
KVStore:                Ok   etcd: ...
Kubernetes:             Ok   1.17+ (v1.17.6-3) [linux/amd64]
...
ClusterMesh:            2/2 clusters ready, 0 global-services

More verbose:

(cluster1 node1) $ cilium status --verbose
KVStore:                Ok   etcd: ...
Kubernetes:             Ok   1.17+ (v1.17.6-3) [linux/amd64]
...
ClusterMesh:   1/1 clusters ready, 0 global-services
   cluster2: ready, xx nodes, xx identities, 0 services, 0 failures (last: never)
   └  etcd: 1/1 connected, ...

List all nodes of all clusters in the mesh:

(cluster1 node1) $ cilium node list
Name               IPv4 Address   Endpoint CIDR    IPv6 Address   Endpoint CIDR
cluster1/node1     10.xx.xx.xx   10.xx.xx.xx/24
cluster1/node2     10.xx.xx.xx   10.xx.xx.xx/24
...
cluster2/node1     10.xx.xx.xx   10.xx.xx.xx/24
cluster2/node2     10.xx.xx.xx   10.xx.xx.xx/24
...

cilium 网络包追踪

(cluster1 node1) cilium monitor

参考

kubernetes multi-cluster
https://arthurchiao.art/blog/cilium-clustermesh/