External traffic policy
Services can be configured to have externalTrafficPolicy Cluster (default) or Local. If the externalTrafficPolicy is Cluster (default) the ECFE speaker on a given node will announce the service External IP no matter if the node has a service endpoint. If the externalTrafficPolicy is Local the ECFE speaker on a given node will announce the service External IP, only if a healthy service endpoint exist on the node.
Ingress controller service manifest.
eccd@control-plane-wfb88:~> kubectl get svc ingress-nginx -n ingress-nginx -o yaml
apiVersion: v1
kind: Service
metadata:
{…}
spec:
clusterIP: 10.97.74.168
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 32151
port: 80
protocol: TCP
targetPort: 80 - name: https
nodePort: 31255
port: 443
protocol: TCP
targetPort: 443
selector:
app: ingress-nginx
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:- ip: 10.33.151.160
let’s use dc173prometheus1.cloud.k2.ericsson.se as example.
Source IP: seroius01242
10.210.150.87/28
Dest: dc173prometheus1.cloud.k2.ericsson.se
1. Lookup DNS record:eccd@control-plane-wfb88:~> nslookup dc173prometheus1.cloud.k2.ericsson.seServer: 10.96.0.10Address: 10.96.0.10#53Non-authoritative answer:dc173prometheus1.cloud.k2.ericsson.se canonical name = dc173dashboard1.cloud.k2.ericsson.se.Name: dc173dashboard1.cloud.k2.ericsson.seAddress: 10.33.151.160eccd@control-plane-wfb88:~>2. Verify FQDN is exposed via Kubernetes ingresseccd@control-plane-wfb88:~> kubectl get ing -ANAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGEmonitoring prometheus-server <none> dc173prometheus1.cloud.k2.ericsson.se 10.0.10.102,10.0.10.103,10.0.10.104,10.0.10.108 80 28dExpose Prometheus via ingress if not ready done:cat >> PM-ingress.yaml << EOFapiVersion: extensions/v1beta1kind: Ingressmetadata:labels:app: eric-pm-servercomponent: servername: prometheus-servernamespace: monitoringspec:rules:- host: dc173prometheus1.cloud.k2.ericsson.sehttp:paths:- backend:serviceName: eric-pm-serverservicePort: 9090path: /EOFApply ingress rulekubectl apply -f PM-ingress.yaml3. Verify external IP is exposed by ingress-nginx service with type LoadBalancer.eccd@control-plane-wfb88:~> kubectl get svc -ANAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEingress-nginx ingress-nginx LoadBalancer 10.97.74.168 10.33.151.160 80:32151/TCP,443:31255/TCP 28dAt this stage, DNS record is consistent with Kubernetes ingress configuration.4. Check ECFE configmapeccd@control-plane-wfb88:~> kubectl get cm ecfe-ccdadm -n kube-system -o yamlapiVersion: v1data:config: |bgp-bfd-peers:- peer-address: 10.33.151.66peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- ingress- oam-pool- peer-address: 10.33.151.67peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- ingress- oam-pool- peer-address: 10.33.152.66peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- traffic-pool- peer-address: 10.33.152.67peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- traffic-pool- peer-address: 172.80.11.2peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- traffic2-pool- peer-address: 172.80.11.3peer-asn: 4200000000my-asn: 4259841731hold-time: 3smin-rx: 300msmin-tx: 300msmultiplier: 3my-address-pools:- traffic2-pooladdress-pools:- name: ingressprotocol: bgpaddresses:- 10.33.151.160/32- name: oam-poolprotocol: bgpaddresses:- 10.33.151.161-10.33.151.167auto-assign: false- name: traffic-poolprotocol: bgpaddresses:- 10.33.152.16/28auto-assign: false- name: traffic2-poolprotocol: bgpaddresses:- 10.33.152.32/28auto-assign: falsekind: ConfigMapVIP 10.33.151.160/32 is configured in ECFE ingress address pool. It's announced to BGP peer 10.33.151.66 and 10.33.151.67.5. Verify SLX configurationDC173-SLX-L1A# show run int veinterface Ve 805vrf forwarding DC173_CCD1_om_vrip anycast-address 10.33.151.65/27ip address 10.33.151.66/27no shutdown!DC173-SLX-L1B# show run int veinterface Ve 805vrf forwarding DC173_CCD1_om_vrip anycast-address 10.33.151.65/27ip address 10.33.151.67/27no shutdown!DC173-SLX-L1B#DC173-SLX-L1A# show ip bgp vrf DC173_CCD1_om_vrTotal number of BGP Routes: 31Status codes: s suppressed, d damped, h history, * valid, > best, i internal, S stale, x best-externalOrigin codes: i - IGP, e - EGP, ? - incompleteNetwork Next Hop RD MED LocPrf Weight Path*>x 0.0.0.0/0 10.90.0.2 none 100 0 100 i* 0.0.0.0/0 10.90.0.6 none 100 0 100 i*i 0.0.0.0/0 21.1.1.1 none 100 0 100 i*> 10.33.151.64/27 0.0.0.0 0 100 32768 ?*i 10.33.151.64/27 21.1.1.1 0 100 0 ?*> 10.33.151.96/27 0.0.0.0 0 100 32768 ?*i 10.33.151.96/27 21.1.1.1 0 100 0 ?*> 10.33.151.128/27 0.0.0.0 0 100 32768 ?*i 10.33.151.128/27 21.1.1.1 0 100 0 ?*>x 10.33.151.160/32 10.33.151.68 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.69 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.70 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.71 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.72 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.73 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.74 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.75 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.76 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.77 none 100 0 4259841731 i* 10.33.151.160/32 10.33.151.78 none 100 0 4259841731 i*i 10.33.151.160/32 21.1.1.1 none 100 0 4259841731 i*> 10.33.151.176/28 0.0.0.0 0 100 32768 ?*i 10.33.151.176/28 21.1.1.1 0 100 0 ?*> 10.90.0.0/30 0.0.0.0 0 100 32768 ?*> 10.90.0.4/30 0.0.0.0 0 100 32768 ?*>i 10.90.1.0/30 21.1.1.1 0 100 0 ?*>i 10.90.1.4/30 21.1.1.1 0 100 0 ?*> 21.1.1.0/31 0.0.0.0 0 100 32768 ?*i 21.1.1.0/31 21.1.1.1 0 100 0 ?*> 172.80.70.0/24 0.0.0.0 0 100 32768 ?*i 172.80.70.0/24 21.1.1.1 0 100 0 ?DC173-SLX-L1A#Dst=10.33.151.160, there are a few next hops, from 10.33.151.68 to 10.33.151.78.Traffic will be load balanced by ECMP towards those 11 IPs.<--------First load balancing by ECMP - Layer 4-------->SLX supports ECMP for L3 forwarding using a modulo N hash algorithm implemented in the forwarding ASICsTraffic is forwarded to a specific nexthop derived by hashing on the 5-tuple frame data (IPSA, IPDA, IPProto, L4 SP, L4 DP) and computing the specific index using a module N method when N = pathCount in the ECMP list of Nexthops6. Verify who owns those BGP speaker IP:eccd@control-plane-wfb88:~> for i in `kubectl get no -o json | jq -r '.items[].status.addresses[] | select(.type=="InternalIP") | .address'` ; do { ssh $i -q hostname & ssh $i -q ip a show ccd_ecfe_om | grep "inet " | awk -F'[: ]+' '{ print $3 }'; } | paste -s; donecontrol-plane-4qjt9 10.33.151.77/27control-plane-vxt5p 10.33.151.68/27 10.33.151.132/32control-plane-wfb88 10.33.151.78/27pool1-5db74894bb-tf54f 10.33.151.69/27pool2-9976d6447-2gnmp 10.33.151.73/27pool2-9976d6447-5bhnl 10.33.151.74/27pool2-9976d6447-djhjx 10.33.151.75/27pool2-9976d6447-ms9hg 10.33.151.71/2710.33.151.70/27 pool2-9976d6447-nd4xlpool2-9976d6447-r776t 10.33.151.72/27pool2-9976d6447-xmbcz 10.33.151.76/277. Verify src=10.210.150.87, dst=10.33.151.160 can be seen in one of the BGP speaker node. One BGP speaker node has 1/11 chance to receive the ingress traffic.eccd@control-plane-wfb88:~> sudo tcpdump -vvveni ccd_ecfe_omFrame 1: 74 bytes on wire (592 bits), 74 bytes captured (592 bits)Ethernet II, Src: ExtremeN_d6:8e:bc (00:04:96:d6:8e:bc), Dst: ea:a3:88:4e:f3:00 (ea:a3:88:4e:f3:00)Internet Protocol Version 4, Src: 10.210.150.87, Dst: 10.33.151.160Transmission Control Protocol, Src Port: 56200, Dst Port: 80, Seq: 0, Len: 0Frame 4: 74 bytes on wire (592 bits), 74 bytes captured (592 bits)Ethernet II, Src: ExtremeN_d6:88:76 (00:04:96:d6:88:76), Dst: ea:a3:88:4e:f3:00 (ea:a3:88:4e:f3:00)Internet Protocol Version 4, Src: 10.210.150.87, Dst: 10.33.151.160Transmission Control Protocol, Src Port: 56202, Dst Port: 80, Seq: 0, Len: 08. Verify the IP is forwarded by SLX.00:04:96:d6:8e:bc belongs to SLX1.00:04:96:d6:88:76 belongs to SLX2.DC173-SLX-L1A# show interface | begin "Ve 805"Ve 805 is up, line protocol is upAddress is 0004.96d6.8ebc, Current address is 0004.96d6.8ebcDC173-SLX-L1B# show interface | begin "Ve 805"Ve 805 is up, line protocol is upAddress is 0004.96d6.8876, Current address is 0004.96d6.8876Interface index (ifindex) is 1207960357 (0x48000325)iptables approach - Very important concept - user defined chainDue to externalTrafficPolicy: Cluster, all BGP speaker node shall be able to see the iptables rule:a. All traffic destined 10.33.151.160/32 with dest port 443 will enter user defined KUBE-FW-4E7KSV2ABIFJRAUZ chain.control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep 10.33.151.160-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-4E7KSV2ABIFJRAUZ <-next step b-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-REQ4FPVT7WYF4VLAcontrol-plane-mjnp4:/home/eccd #b. All traffic enter user defined chain KUBE-FW-4E7KSV2ABIFJRAUZ will be firstly sent to user fined chain KUBE-MARK-MASQ, then KUBE-SVC-4E7KSV2ABIFJRAUZ, eventually if nothing matched, KUBE-MARK-DROPcontrol-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-FW-4E7KSV2ABIFJRAUZ-N KUBE-FW-4E7KSV2ABIFJRAUZ-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-MARK-MASQ <-next step i-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-SVC-4E7KSV2ABIFJRAUZ <-next step ii-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-MARK-DROP <-next step iii-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-4E7KSV2ABIFJRAUZcontrol-plane-mjnp4:/home/eccd #i. all traffic sent to chain KUBE-MARK-MASQ will be set with mark 0x4000/0x4000-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000Chain KUBE-MARK-MASQ (252 references)target prot opt source destinationMARK all -- anywhere anywhere MARK or 0x4000ii. All traffic with processed in chain KUBE-MARK-MASQ will be processed by the next rule, jump to chain KUBE-SVC-4E7KSV2ABIFJRAUZ .control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SVC-4E7KSV2ABIFJRAUZ | grep -v "j KUBE-SVC-4E7KSV2ABIFJRAUZ"-N KUBE-SVC-4E7KSV2ABIFJRAUZ-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-VNARWU5QWGSG2GUG <25% traffic will be load balanced to chain KUBE-SEP-VNARWU5QWGSG2GUG - remaining three nginx ingress controller pods>-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-JRQUMAVVXJLRS3O3 <33.3% remaining traffic will be load balanced to chain KUBE-SEP-JRQUMAVVXJLRS3O3 - remaining two nginx ingress controller pods>-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-ILQWMWXVY6HNWZZT <50% remaining traffic will be load balanced to chain KUBE-SEP-ILQWMWXVY6HNWZZT- remaining one nginx ingress controller pods>-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-SEP-IM3OJF4BZQHSU4FK<remaining stream will be certainly sent to chain KUBE-SEP-IM3OJF4BZQHSU4FK - traffic are evenly load balanced to all four nginx ingress controller pods>control-plane-mjnp4:/home/eccd #1) Traffic entered KUBE-SEP-VNARWU5QWGSG2GUG, it's DNAT'ed to 192.168.103.142:443 since it's the above other three chains(nginx pods)control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-VNARWU5QWGSG2GUG | grep -v "j KUBE-SEP-VNARWU5QWGSG2GUG"-N KUBE-SEP-VNARWU5QWGSG2GUG-A KUBE-SEP-VNARWU5QWGSG2GUG -s 192.168.103.142/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ-A KUBE-SEP-VNARWU5QWGSG2GUG -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.103.142:443control-plane-mjnp4:/home/eccd #control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-JRQUMAVVXJLRS3O3 | grep -v "j KUBE-SEP-JRQUMAVVXJLRS3O3"-N KUBE-SEP-JRQUMAVVXJLRS3O3-A KUBE-SEP-JRQUMAVVXJLRS3O3 -s 192.168.107.128/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ-A KUBE-SEP-JRQUMAVVXJLRS3O3 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.107.128:443control-plane-mjnp4:/home/eccd #control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-ILQWMWXVY6HNWZZT | grep -v "j KUBE-SEP-ILQWMWXVY6HNWZZT"-N KUBE-SEP-ILQWMWXVY6HNWZZT-A KUBE-SEP-ILQWMWXVY6HNWZZT -s 192.168.176.202/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ-A KUBE-SEP-ILQWMWXVY6HNWZZT -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.176.202:443control-plane-mjnp4:/home/eccd #control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-IM3OJF4BZQHSU4FK | grep -v "j KUBE-SEP-IM3OJF4BZQHSU4FK"-N KUBE-SEP-IM3OJF4BZQHSU4FK-A KUBE-SEP-IM3OJF4BZQHSU4FK -s 192.168.95.125/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ-A KUBE-SEP-IM3OJF4BZQHSU4FK -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.95.125:443control-plane-mjnp4:/home/eccd #Question, who owns those IP?2) Traffic will be sent to one of nginx ingress controller pods<--------Second load balancing by NGINX - Layer 7 -------->nginx-ingress-controller will load balance the HTTP traffic e.g. based on the URL.eccd@control-plane-mjnp4:~> kubectl get po -n ingress-nginx -l app=ingress-nginx -o=custom-columns=NODE:.spec.nodeName,Name:.metadata.name,hostIP:.status.hostIP,podIP:.status.podIPNODE Name hostIP podIPpool1-868fc5689d-9m7m5 nginx-ingress-controller-5dff4dcd48-dfbnr 10.0.10.102 192.168.107.128control-plane-gwnnh nginx-ingress-controller-5dff4dcd48-m69kc 10.0.10.109 192.168.95.125control-plane-cr2zh nginx-ingress-controller-5dff4dcd48-rcgv8 10.0.10.110 192.168.176.202pool2-6859c57999-2t8hx nginx-ingress-controller-5dff4dcd48-t8v2g 10.0.10.108 192.168.103.142eccd@control-plane-mjnp4:~>3) Enter NGINX ingress controller pod and check HTTP load balancing rules - traffic will be sent to service monitoring/eric-pm-server:9090:eccd@control-plane-mjnp4:~> kubectl exec nginx-ingress-controller-5dff4dcd48-dfbnr -n ingress-nginx -ti -- bashbash-5.0$ cat nginx.conf# Configuration checksum: 5612144526815823684upstream upstream_balancer {### Attention!!!## We no longer create "upstream" section for every backend.# Backends are handled dynamically using Lua. If you would like to debug# and see what backends ingress-nginx has in its memory you can# install our kubectl plugin https://kubernetes.github.io/ingress-nginx/kubectl-plugin.# Once you have the plugin you can use "kubectl ingress-nginx backends" command to# inspect current backends.###### start server dc173prometheus1.cloud.k2.ericsson.seserver {server_name dc173prometheus1.cloud.k2.ericsson.se ;listen 80 ;listen 442 proxy_protocol ssl http2 ;set $proxy_upstream_name "-";ssl_certificate_by_lua_block {certificate.call()}location / {set $namespace "monitoring";set $ingress_name "prometheus-server";set $service_name "eric-pm-server";set $service_port "9090";set $location_path "/";rewrite_by_lua_block {lua_ingress.rewrite({force_ssl_redirect = false,ssl_redirect = true,force_no_ssl_redirect = false,use_port_in_redirects = false,})balancer.rewrite()plugins.run()}# be careful with `access_by_lua_block` and `satisfy any` directives as satisfy any# will always succeed when there's `access_by_lua_block` that does not have any lua code doing `ngx.exit(ngx.DECLINED)`# other authentication method such as basic auth or external auth useless - all requests will be allowed.#access_by_lua_block {#}header_filter_by_lua_block {lua_ingress.header()plugins.run()}body_filter_by_lua_block {}log_by_lua_block {balancer.log()monitor.call()plugins.run()}port_in_redirect off;set $balancer_ewma_score -1;set $proxy_upstream_name "monitoring-eric-pm-server-9090";set $proxy_host $proxy_upstream_name;set $pass_access_scheme $scheme;set $pass_server_port $server_port;set $best_http_host $http_host;set $pass_port $pass_server_port;set $proxy_alternative_upstream_name "";client_max_body_size 1m;proxy_set_header Host $best_http_host;# Pass the extracted client certificate to the backend# Allow websocket connectionsproxy_set_header Upgrade $http_upgrade;proxy_set_header Connection $connection_upgrade;proxy_set_header X-Request-ID $req_id;proxy_set_header X-Real-IP $remote_addr;proxy_set_header X-Forwarded-For $remote_addr;proxy_set_header X-Forwarded-Host $best_http_host;proxy_set_header X-Forwarded-Port $pass_port;proxy_set_header X-Forwarded-Proto $pass_access_scheme;proxy_set_header X-Scheme $pass_access_scheme;# Pass the original X-Forwarded-Forproxy_set_header X-Original-Forwarded-For $http_x_forwarded_for;# mitigate HTTPoxy Vulnerability# https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/proxy_set_header Proxy "";# Custom headers to proxied serverproxy_connect_timeout 5s;proxy_send_timeout 60s;proxy_read_timeout 60s;proxy_buffering off;proxy_buffer_size 4k;proxy_buffers 4 4k;proxy_max_temp_file_size 1024m;proxy_request_buffering on;proxy_http_version 1.1;proxy_cookie_domain off;proxy_cookie_path off;# In case of errors try the next upstream server before returning an errorproxy_next_upstream error timeout;proxy_next_upstream_timeout 0;proxy_next_upstream_tries 3;proxy_pass http://upstream_balancer;proxy_redirect off;}}## end server dc173prometheus1.cloud.k2.ericsson.se4) Check service eric-pm-server definitioneccd@control-plane-wfb88:~> kubectl get ing prometheus-server -n monitoring -o yamlapiVersion: extensions/v1beta1kind: Ingressmetadata:...spec:rules:- host: dc173prometheus1.cloud.k2.ericsson.sehttp:paths:- backend:serviceName: eric-pm-serverservicePort: 9090path: /pathType: ImplementationSpecificstatus:loadBalancer:ingress:- ip: 10.0.10.102- ip: 10.0.10.103- ip: 10.0.10.104- ip: 10.0.10.1085) Traffic then will be forwarded to service clusterIP 10.107.185.250.eccd@control-plane-mjnp4:~> kubectl get svc eric-pm-server -n monitoring -o yamlspec:clusterIP: 10.107.185.250ports:- name: httpport: 9090protocol: TCPtargetPort: 9090selector:app: eric-pm-servercomponent: serverrelease: pm-serversessionAffinity: Nonetype: ClusterIPstatus:loadBalancer: {}control-plane-mjnp4:/home/eccd # iptables -t nat -S | grep 10.107.185.250-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.107.185.250/32 -p tcp -m comment --comment "monitoring/eric-pm-server:http cluster IP" -m tcp --dport 9090 -j KUBE-MARK-MASQ-A KUBE-SERVICES -d 10.107.185.250/32 -p tcp -m comment --comment "monitoring/eric-pm-server:http cluster IP" -m tcp --dport 9090 -j KUBE-SVC-6K4R6JTKTVNBU5UPcontrol-plane-mjnp4:/home/eccd #6) First rule in chain KUBE-SERVICES is to set mark in KUBE-MARK-MASQSecond rule in chain KUBE-SERVICES will forward traffic to chain KUBE-SVC-6K4R6JTKTVNBU5UPcontrol-plane-mjnp4:/home/eccd # iptables -t nat -S | grep KUBE-SVC-6K4R6JTKTVNBU5UP | grep -v "j KUBE-SVC-6K4R6JTKTVNBU5UP"-N KUBE-SVC-6K4R6JTKTVNBU5UP-A KUBE-SVC-6K4R6JTKTVNBU5UP -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-SEP-CNB2ICYJPHYJI3AYcontrol-plane-mjnp4:/home/eccd #7) traffic enters chain KUBE-SEP-CNB2ICYJPHYJI3AY, will be sent to 192.168.22.247 which is pod IP.control-plane-mjnp4:/home/eccd # iptables -t nat -S | grep KUBE-SEP-CNB2ICYJPHYJI3AY | grep -v "j KUBE-SEP-CNB2ICYJPHYJI3AY"-N KUBE-SEP-CNB2ICYJPHYJI3AY-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090control-plane-mjnp4:/home/eccd #eccd@control-plane-mjnp4:~> kubectl get po -l app=eric-pm-server -A -o wideNAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESevnfm eric-pm-server-0 2/2 Running 0 4h29m 192.168.188.101 pool2-6859c57999-mzv2d <none> <none>monitoring eric-pm-server-0 2/2 Running 0 2d1h 192.168.22.247 pool2-6859c57999-7hfs2 <none> <none>eccd@control-plane-mjnp4:~>Since the server in question is in monitoring namespace, the correct PM server shall be the second one.<--------Third load balancing by K8s service - round-robin using iptables mode-------->There is only one eric-pm-server pod so there is no round-robin load balancing.eccd@control-plane-mjnp4:~> kubectl get no -o wide | grep pool2-6859c57999-7hfs2pool2-6859c57999-7hfs2 Ready worker 19d v1.18.8 10.0.10.105 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.51-default docker://19.3.11eccd@control-plane-mjnp4:~> ssh 10.0.10.105 -qLast login: Mon Sep 21 16:27:26 2020 from 10.0.10.101eccd@pool2-6859c57999-7hfs2:~> sudo supool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep 192.168.22.247-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090pool2-6859c57999-7hfs2:/home/eccd #Traffic enters the correct pod and we will now look at the return path.iii. Traffic will be set to mask 0x8000 if not matched by any of above rules.control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-MARK-DROP | grep -v "j KUBE-MARK-DROP"-N KUBE-MARK-DROP-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000control-plane-mjnp4:/home/eccd #eccd@control-plane-wfb88:~> kubectl get ing -ANAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGEmonitoring prometheus-server <none> dc173prometheus1.cloud.k2.ericsson.se 10.0.10.102,10.0.10.103,10.0.10.104,10.0.10.108 80 28d
==============
return path
Return path will be using default route if there is no explicit route or existing route for dst=10.210.150.87.
1. Start point - return traffic with source IP 192.168.22.247/32 will be sent to KUBE-MARK-MASQ chain
pool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep 192.168.22.247
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment —comment “monitoring/eric-pm-server:http” -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment —comment “monitoring/eric-pm-server:http” -m tcp -j DNAT —to-destination 192.168.22.247:9090
pool2-6859c57999-7hfs2:/home/eccd #
2. All return traffic are set with mask 0x4000.
A KUBE-MARK-MASQ -j MARK —set-xmark 0x4000/0x4000
Chain KUBE-MARK-MASQ (242 references)
target prot opt source destination
MARK all -- anywhere anywhere MARK or 0x4000
3. traffic with UBE-MARK-MASQ mark will return to chain CNB2ICYJPHYJI3AY and check what's next rule - no more rules in chain KUBE-SEP-CNB2ICYJPHYJI3AY
pool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep KUBE-SEP-CNB2ICYJPHYJI3AY
-N KUBE-SEP-CNB2ICYJPHYJI3AY
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090
-A KUBE-SVC-6K4R6JTKTVNBU5UP -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-SEP-CNB2ICYJPHYJI3AY
pool2-6859c57999-7hfs2:/home/eccd #
4. KUBE-MARK-MASQ adds a Netfilter mark to packets originated from the eric-pm-serve service which destined outside the cluster’s network. Packets with this mark will be altered in a POSTROUTING rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.
From <https://www.stackrox.com/post/2020/01/kubernetes-networking-demystified/>
5. Destination IP is not a the explicit routing entries, use default route
pool2-6859c57999-7hfs2:/home/eccd # ip r | grep 10.210.150.87
pool2-6859c57999-7hfs2:/home/eccd #
pool2-6859c57999-7hfs2:/home/eccd # ip r | grep default
default via 10.33.152.65 dev ecfe_traf1 proto static metric 804
pool2-6859c57999-7hfs2:/home/eccd #
pool2-6859c57999-7hfs2:/home/eccd # ip a show ecfe_traf1
12: ecfe_traf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 36:1f:d1:91:3a:16 brd ff:ff:ff:ff:ff:ff
inet 10.33.152.74/27 brd 10.33.152.95 scope global noprefixroute ecfe_traf1
valid_lft forever preferred_lft forever
pool2-6859c57999-7hfs2:/home/eccd #
At this stage, IP-SA= 10.33.152.74, IP-DA=10.210.150.87.
Since the packet ingress at DC173_CCD1_om_vr, the egress traffic must go back to the same VRF.
CNIS 1.0 default route belongs to application, therefore the egress traffic shall be forwarded via ccd_ecfe_om interface.
Ingress and egress traffic are routed via different VRF, TCP session won't be established.
Add route(temporary workaround, not persistent):
for i in `kubectl get no -o json | jq -r '.items[].status.addresses[] | select(.type=="InternalIP") | .address'` ; do ssh $i -q sudo ip route add <web-browser-ip>/32 via <ccd_ecfe-SLX-om_vr-ve_anycast> ; done
Egress traffic will be
Prometheus clusterip(iptables+conntrack) -> nginx ingress controller pod(-> srcIP=ingress-pod-ip) -> server host iptables hosting nginx controller pod(-> srcIP=ingress-ip) > lookup its host OS routing table -> ecfe gw(anycast GW IP in SLX) -> 10.210.150.87
src-IP = Promethers pod
