External traffic policy
Services can be configured to have externalTrafficPolicy Cluster (default) or Local. If the externalTrafficPolicy is Cluster (default) the ECFE speaker on a given node will announce the service External IP no matter if the node has a service endpoint. If the externalTrafficPolicy is Local the ECFE speaker on a given node will announce the service External IP, only if a healthy service endpoint exist on the node.
Ingress controller service manifest.
eccd@control-plane-wfb88:~> kubectl get svc ingress-nginx -n ingress-nginx -o yaml
apiVersion: v1
kind: Service
metadata:
{…}
spec:
clusterIP: 10.97.74.168
externalTrafficPolicy: Cluster
ports:
- name: http
nodePort: 32151
port: 80
protocol: TCP
targetPort: 80 - name: https
nodePort: 31255
port: 443
protocol: TCP
targetPort: 443
selector:
app: ingress-nginx
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:- ip: 10.33.151.160
let’s use dc173prometheus1.cloud.k2.ericsson.se as example.
Source IP: seroius01242
10.210.150.87/28
Dest: dc173prometheus1.cloud.k2.ericsson.se
1. Lookup DNS record:
eccd@control-plane-wfb88:~> nslookup dc173prometheus1.cloud.k2.ericsson.se
Server: 10.96.0.10
Address: 10.96.0.10#53
Non-authoritative answer:
dc173prometheus1.cloud.k2.ericsson.se canonical name = dc173dashboard1.cloud.k2.ericsson.se.
Name: dc173dashboard1.cloud.k2.ericsson.se
Address: 10.33.151.160
eccd@control-plane-wfb88:~>
2. Verify FQDN is exposed via Kubernetes ingress
eccd@control-plane-wfb88:~> kubectl get ing -A
NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE
monitoring prometheus-server <none> dc173prometheus1.cloud.k2.ericsson.se 10.0.10.102,10.0.10.103,10.0.10.104,10.0.10.108 80 28d
Expose Prometheus via ingress if not ready done:
cat >> PM-ingress.yaml << EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
labels:
app: eric-pm-server
component: server
name: prometheus-server
namespace: monitoring
spec:
rules:
- host: dc173prometheus1.cloud.k2.ericsson.se
http:
paths:
- backend:
serviceName: eric-pm-server
servicePort: 9090
path: /
EOF
Apply ingress rule
kubectl apply -f PM-ingress.yaml
3. Verify external IP is exposed by ingress-nginx service with type LoadBalancer.
eccd@control-plane-wfb88:~> kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx ingress-nginx LoadBalancer 10.97.74.168 10.33.151.160 80:32151/TCP,443:31255/TCP 28d
At this stage, DNS record is consistent with Kubernetes ingress configuration.
4. Check ECFE configmap
eccd@control-plane-wfb88:~> kubectl get cm ecfe-ccdadm -n kube-system -o yaml
apiVersion: v1
data:
config: |
bgp-bfd-peers:
- peer-address: 10.33.151.66
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- ingress
- oam-pool
- peer-address: 10.33.151.67
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- ingress
- oam-pool
- peer-address: 10.33.152.66
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- traffic-pool
- peer-address: 10.33.152.67
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- traffic-pool
- peer-address: 172.80.11.2
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- traffic2-pool
- peer-address: 172.80.11.3
peer-asn: 4200000000
my-asn: 4259841731
hold-time: 3s
min-rx: 300ms
min-tx: 300ms
multiplier: 3
my-address-pools:
- traffic2-pool
address-pools:
- name: ingress
protocol: bgp
addresses:
- 10.33.151.160/32
- name: oam-pool
protocol: bgp
addresses:
- 10.33.151.161-10.33.151.167
auto-assign: false
- name: traffic-pool
protocol: bgp
addresses:
- 10.33.152.16/28
auto-assign: false
- name: traffic2-pool
protocol: bgp
addresses:
- 10.33.152.32/28
auto-assign: false
kind: ConfigMap
VIP 10.33.151.160/32 is configured in ECFE ingress address pool. It's announced to BGP peer 10.33.151.66 and 10.33.151.67.
5. Verify SLX configuration
DC173-SLX-L1A# show run int ve
interface Ve 805
vrf forwarding DC173_CCD1_om_vr
ip anycast-address 10.33.151.65/27
ip address 10.33.151.66/27
no shutdown
!
DC173-SLX-L1B# show run int ve
interface Ve 805
vrf forwarding DC173_CCD1_om_vr
ip anycast-address 10.33.151.65/27
ip address 10.33.151.67/27
no shutdown
!
DC173-SLX-L1B#
DC173-SLX-L1A# show ip bgp vrf DC173_CCD1_om_vr
Total number of BGP Routes: 31
Status codes: s suppressed, d damped, h history, * valid, > best, i internal, S stale, x best-external
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop RD MED LocPrf Weight Path
*>x 0.0.0.0/0 10.90.0.2 none 100 0 100 i
* 0.0.0.0/0 10.90.0.6 none 100 0 100 i
*i 0.0.0.0/0 21.1.1.1 none 100 0 100 i
*> 10.33.151.64/27 0.0.0.0 0 100 32768 ?
*i 10.33.151.64/27 21.1.1.1 0 100 0 ?
*> 10.33.151.96/27 0.0.0.0 0 100 32768 ?
*i 10.33.151.96/27 21.1.1.1 0 100 0 ?
*> 10.33.151.128/27 0.0.0.0 0 100 32768 ?
*i 10.33.151.128/27 21.1.1.1 0 100 0 ?
*>x 10.33.151.160/32 10.33.151.68 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.69 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.70 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.71 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.72 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.73 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.74 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.75 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.76 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.77 none 100 0 4259841731 i
* 10.33.151.160/32 10.33.151.78 none 100 0 4259841731 i
*i 10.33.151.160/32 21.1.1.1 none 100 0 4259841731 i
*> 10.33.151.176/28 0.0.0.0 0 100 32768 ?
*i 10.33.151.176/28 21.1.1.1 0 100 0 ?
*> 10.90.0.0/30 0.0.0.0 0 100 32768 ?
*> 10.90.0.4/30 0.0.0.0 0 100 32768 ?
*>i 10.90.1.0/30 21.1.1.1 0 100 0 ?
*>i 10.90.1.4/30 21.1.1.1 0 100 0 ?
*> 21.1.1.0/31 0.0.0.0 0 100 32768 ?
*i 21.1.1.0/31 21.1.1.1 0 100 0 ?
*> 172.80.70.0/24 0.0.0.0 0 100 32768 ?
*i 172.80.70.0/24 21.1.1.1 0 100 0 ?
DC173-SLX-L1A#
Dst=10.33.151.160, there are a few next hops, from 10.33.151.68 to 10.33.151.78.
Traffic will be load balanced by ECMP towards those 11 IPs.
<--------First load balancing by ECMP - Layer 4-------->
SLX supports ECMP for L3 forwarding using a modulo N hash algorithm implemented in the forwarding ASICs
Traffic is forwarded to a specific nexthop derived by hashing on the 5-tuple frame data (IPSA, IPDA, IPProto, L4 SP, L4 DP) and computing the specific index using a module N method when N = pathCount in the ECMP list of Nexthops
6. Verify who owns those BGP speaker IP:
eccd@control-plane-wfb88:~> for i in `kubectl get no -o json | jq -r '.items[].status.addresses[] | select(.type=="InternalIP") | .address'` ; do { ssh $i -q hostname & ssh $i -q ip a show ccd_ecfe_om | grep "inet " | awk -F'[: ]+' '{ print $3 }'; } | paste -s; done
control-plane-4qjt9 10.33.151.77/27
control-plane-vxt5p 10.33.151.68/27 10.33.151.132/32
control-plane-wfb88 10.33.151.78/27
pool1-5db74894bb-tf54f 10.33.151.69/27
pool2-9976d6447-2gnmp 10.33.151.73/27
pool2-9976d6447-5bhnl 10.33.151.74/27
pool2-9976d6447-djhjx 10.33.151.75/27
pool2-9976d6447-ms9hg 10.33.151.71/27
10.33.151.70/27 pool2-9976d6447-nd4xl
pool2-9976d6447-r776t 10.33.151.72/27
pool2-9976d6447-xmbcz 10.33.151.76/27
7. Verify src=10.210.150.87, dst=10.33.151.160 can be seen in one of the BGP speaker node. One BGP speaker node has 1/11 chance to receive the ingress traffic.
eccd@control-plane-wfb88:~> sudo tcpdump -vvveni ccd_ecfe_om
Frame 1: 74 bytes on wire (592 bits), 74 bytes captured (592 bits)
Ethernet II, Src: ExtremeN_d6:8e:bc (00:04:96:d6:8e:bc), Dst: ea:a3:88:4e:f3:00 (ea:a3:88:4e:f3:00)
Internet Protocol Version 4, Src: 10.210.150.87, Dst: 10.33.151.160
Transmission Control Protocol, Src Port: 56200, Dst Port: 80, Seq: 0, Len: 0
Frame 4: 74 bytes on wire (592 bits), 74 bytes captured (592 bits)
Ethernet II, Src: ExtremeN_d6:88:76 (00:04:96:d6:88:76), Dst: ea:a3:88:4e:f3:00 (ea:a3:88:4e:f3:00)
Internet Protocol Version 4, Src: 10.210.150.87, Dst: 10.33.151.160
Transmission Control Protocol, Src Port: 56202, Dst Port: 80, Seq: 0, Len: 0
8. Verify the IP is forwarded by SLX.
00:04:96:d6:8e:bc belongs to SLX1.
00:04:96:d6:88:76 belongs to SLX2.
DC173-SLX-L1A# show interface | begin "Ve 805"
Ve 805 is up, line protocol is up
Address is 0004.96d6.8ebc, Current address is 0004.96d6.8ebc
DC173-SLX-L1B# show interface | begin "Ve 805"
Ve 805 is up, line protocol is up
Address is 0004.96d6.8876, Current address is 0004.96d6.8876
Interface index (ifindex) is 1207960357 (0x48000325)
iptables approach - Very important concept - user defined chain
Due to externalTrafficPolicy: Cluster, all BGP speaker node shall be able to see the iptables rule:
a. All traffic destined 10.33.151.160/32 with dest port 443 will enter user defined KUBE-FW-4E7KSV2ABIFJRAUZ chain.
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep 10.33.151.160
-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-4E7KSV2ABIFJRAUZ <-next step b
-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-REQ4FPVT7WYF4VLA
control-plane-mjnp4:/home/eccd #
b. All traffic enter user defined chain KUBE-FW-4E7KSV2ABIFJRAUZ will be firstly sent to user fined chain KUBE-MARK-MASQ, then KUBE-SVC-4E7KSV2ABIFJRAUZ, eventually if nothing matched, KUBE-MARK-DROP
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-FW-4E7KSV2ABIFJRAUZ
-N KUBE-FW-4E7KSV2ABIFJRAUZ
-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-MARK-MASQ <-next step i
-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-SVC-4E7KSV2ABIFJRAUZ <-next step ii
-A KUBE-FW-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -j KUBE-MARK-DROP <-next step iii
-A KUBE-SERVICES -d 10.33.151.160/32 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-4E7KSV2ABIFJRAUZ
control-plane-mjnp4:/home/eccd #
i. all traffic sent to chain KUBE-MARK-MASQ will be set with mark 0x4000/0x4000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
Chain KUBE-MARK-MASQ (252 references)
target prot opt source destination
MARK all -- anywhere anywhere MARK or 0x4000
ii. All traffic with processed in chain KUBE-MARK-MASQ will be processed by the next rule, jump to chain KUBE-SVC-4E7KSV2ABIFJRAUZ .
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SVC-4E7KSV2ABIFJRAUZ | grep -v "j KUBE-SVC-4E7KSV2ABIFJRAUZ"
-N KUBE-SVC-4E7KSV2ABIFJRAUZ
-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-VNARWU5QWGSG2GUG <25% traffic will be load balanced to chain KUBE-SEP-VNARWU5QWGSG2GUG - remaining three nginx ingress controller pods>
-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-JRQUMAVVXJLRS3O3 <33.3% remaining traffic will be load balanced to chain KUBE-SEP-JRQUMAVVXJLRS3O3 - remaining two nginx ingress controller pods>
-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-ILQWMWXVY6HNWZZT <50% remaining traffic will be load balanced to chain KUBE-SEP-ILQWMWXVY6HNWZZT- remaining one nginx ingress controller pods>
-A KUBE-SVC-4E7KSV2ABIFJRAUZ -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-SEP-IM3OJF4BZQHSU4FK
<remaining stream will be certainly sent to chain KUBE-SEP-IM3OJF4BZQHSU4FK - traffic are evenly load balanced to all four nginx ingress controller pods>
control-plane-mjnp4:/home/eccd #
1) Traffic entered KUBE-SEP-VNARWU5QWGSG2GUG, it's DNAT'ed to 192.168.103.142:443 since it's the above other three chains(nginx pods)
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-VNARWU5QWGSG2GUG | grep -v "j KUBE-SEP-VNARWU5QWGSG2GUG"
-N KUBE-SEP-VNARWU5QWGSG2GUG
-A KUBE-SEP-VNARWU5QWGSG2GUG -s 192.168.103.142/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-VNARWU5QWGSG2GUG -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.103.142:443
control-plane-mjnp4:/home/eccd #
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-JRQUMAVVXJLRS3O3 | grep -v "j KUBE-SEP-JRQUMAVVXJLRS3O3"
-N KUBE-SEP-JRQUMAVVXJLRS3O3
-A KUBE-SEP-JRQUMAVVXJLRS3O3 -s 192.168.107.128/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-JRQUMAVVXJLRS3O3 -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.107.128:443
control-plane-mjnp4:/home/eccd #
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-ILQWMWXVY6HNWZZT | grep -v "j KUBE-SEP-ILQWMWXVY6HNWZZT"
-N KUBE-SEP-ILQWMWXVY6HNWZZT
-A KUBE-SEP-ILQWMWXVY6HNWZZT -s 192.168.176.202/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-ILQWMWXVY6HNWZZT -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.176.202:443
control-plane-mjnp4:/home/eccd #
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-SEP-IM3OJF4BZQHSU4FK | grep -v "j KUBE-SEP-IM3OJF4BZQHSU4FK"
-N KUBE-SEP-IM3OJF4BZQHSU4FK
-A KUBE-SEP-IM3OJF4BZQHSU4FK -s 192.168.95.125/32 -m comment --comment "ingress-nginx/ingress-nginx:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-IM3OJF4BZQHSU4FK -p tcp -m comment --comment "ingress-nginx/ingress-nginx:https" -m tcp -j DNAT --to-destination 192.168.95.125:443
control-plane-mjnp4:/home/eccd #
Question, who owns those IP?
2) Traffic will be sent to one of nginx ingress controller pods
<--------Second load balancing by NGINX - Layer 7 -------->
nginx-ingress-controller will load balance the HTTP traffic e.g. based on the URL.
eccd@control-plane-mjnp4:~> kubectl get po -n ingress-nginx -l app=ingress-nginx -o=custom-columns=NODE:.spec.nodeName,Name:.metadata.name,hostIP:.status.hostIP,podIP:.status.podIP
NODE Name hostIP podIP
pool1-868fc5689d-9m7m5 nginx-ingress-controller-5dff4dcd48-dfbnr 10.0.10.102 192.168.107.128
control-plane-gwnnh nginx-ingress-controller-5dff4dcd48-m69kc 10.0.10.109 192.168.95.125
control-plane-cr2zh nginx-ingress-controller-5dff4dcd48-rcgv8 10.0.10.110 192.168.176.202
pool2-6859c57999-2t8hx nginx-ingress-controller-5dff4dcd48-t8v2g 10.0.10.108 192.168.103.142
eccd@control-plane-mjnp4:~>
3) Enter NGINX ingress controller pod and check HTTP load balancing rules - traffic will be sent to service monitoring/eric-pm-server:9090:
eccd@control-plane-mjnp4:~> kubectl exec nginx-ingress-controller-5dff4dcd48-dfbnr -n ingress-nginx -ti -- bash
bash-5.0$ cat nginx.conf
# Configuration checksum: 5612144526815823684
upstream upstream_balancer {
### Attention!!!
#
# We no longer create "upstream" section for every backend.
# Backends are handled dynamically using Lua. If you would like to debug
# and see what backends ingress-nginx has in its memory you can
# install our kubectl plugin https://kubernetes.github.io/ingress-nginx/kubectl-plugin.
# Once you have the plugin you can use "kubectl ingress-nginx backends" command to
# inspect current backends.
#
###
## start server dc173prometheus1.cloud.k2.ericsson.se
server {
server_name dc173prometheus1.cloud.k2.ericsson.se ;
listen 80 ;
listen 442 proxy_protocol ssl http2 ;
set $proxy_upstream_name "-";
ssl_certificate_by_lua_block {
certificate.call()
}
location / {
set $namespace "monitoring";
set $ingress_name "prometheus-server";
set $service_name "eric-pm-server";
set $service_port "9090";
set $location_path "/";
rewrite_by_lua_block {
lua_ingress.rewrite({
force_ssl_redirect = false,
ssl_redirect = true,
force_no_ssl_redirect = false,
use_port_in_redirects = false,
})
balancer.rewrite()
plugins.run()
}
# be careful with `access_by_lua_block` and `satisfy any` directives as satisfy any
# will always succeed when there's `access_by_lua_block` that does not have any lua code doing `ngx.exit(ngx.DECLINED)`
# other authentication method such as basic auth or external auth useless - all requests will be allowed.
#access_by_lua_block {
#}
header_filter_by_lua_block {
lua_ingress.header()
plugins.run()
}
body_filter_by_lua_block {
}
log_by_lua_block {
balancer.log()
monitor.call()
plugins.run()
}
port_in_redirect off;
set $balancer_ewma_score -1;
set $proxy_upstream_name "monitoring-eric-pm-server-9090";
set $proxy_host $proxy_upstream_name;
set $pass_access_scheme $scheme;
set $pass_server_port $server_port;
set $best_http_host $http_host;
set $pass_port $pass_server_port;
set $proxy_alternative_upstream_name "";
client_max_body_size 1m;
proxy_set_header Host $best_http_host;
# Pass the extracted client certificate to the backend
# Allow websocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Request-ID $req_id;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Forwarded-Host $best_http_host;
proxy_set_header X-Forwarded-Port $pass_port;
proxy_set_header X-Forwarded-Proto $pass_access_scheme;
proxy_set_header X-Scheme $pass_access_scheme;
# Pass the original X-Forwarded-For
proxy_set_header X-Original-Forwarded-For $http_x_forwarded_for;
# mitigate HTTPoxy Vulnerability
# https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/
proxy_set_header Proxy "";
# Custom headers to proxied server
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_buffering off;
proxy_buffer_size 4k;
proxy_buffers 4 4k;
proxy_max_temp_file_size 1024m;
proxy_request_buffering on;
proxy_http_version 1.1;
proxy_cookie_domain off;
proxy_cookie_path off;
# In case of errors try the next upstream server before returning an error
proxy_next_upstream error timeout;
proxy_next_upstream_timeout 0;
proxy_next_upstream_tries 3;
proxy_pass http://upstream_balancer;
proxy_redirect off;
}
}
## end server dc173prometheus1.cloud.k2.ericsson.se
4) Check service eric-pm-server definition
eccd@control-plane-wfb88:~> kubectl get ing prometheus-server -n monitoring -o yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
...
spec:
rules:
- host: dc173prometheus1.cloud.k2.ericsson.se
http:
paths:
- backend:
serviceName: eric-pm-server
servicePort: 9090
path: /
pathType: ImplementationSpecific
status:
loadBalancer:
ingress:
- ip: 10.0.10.102
- ip: 10.0.10.103
- ip: 10.0.10.104
- ip: 10.0.10.108
5) Traffic then will be forwarded to service clusterIP 10.107.185.250.
eccd@control-plane-mjnp4:~> kubectl get svc eric-pm-server -n monitoring -o yaml
spec:
clusterIP: 10.107.185.250
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
selector:
app: eric-pm-server
component: server
release: pm-server
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
control-plane-mjnp4:/home/eccd # iptables -t nat -S | grep 10.107.185.250
-A KUBE-SERVICES ! -s 192.168.0.0/16 -d 10.107.185.250/32 -p tcp -m comment --comment "monitoring/eric-pm-server:http cluster IP" -m tcp --dport 9090 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.107.185.250/32 -p tcp -m comment --comment "monitoring/eric-pm-server:http cluster IP" -m tcp --dport 9090 -j KUBE-SVC-6K4R6JTKTVNBU5UP
control-plane-mjnp4:/home/eccd #
6) First rule in chain KUBE-SERVICES is to set mark in KUBE-MARK-MASQ
Second rule in chain KUBE-SERVICES will forward traffic to chain KUBE-SVC-6K4R6JTKTVNBU5UP
control-plane-mjnp4:/home/eccd # iptables -t nat -S | grep KUBE-SVC-6K4R6JTKTVNBU5UP | grep -v "j KUBE-SVC-6K4R6JTKTVNBU5UP"
-N KUBE-SVC-6K4R6JTKTVNBU5UP
-A KUBE-SVC-6K4R6JTKTVNBU5UP -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-SEP-CNB2ICYJPHYJI3AY
control-plane-mjnp4:/home/eccd #
7) traffic enters chain KUBE-SEP-CNB2ICYJPHYJI3AY, will be sent to 192.168.22.247 which is pod IP.
control-plane-mjnp4:/home/eccd # iptables -t nat -S | grep KUBE-SEP-CNB2ICYJPHYJI3AY | grep -v "j KUBE-SEP-CNB2ICYJPHYJI3AY"
-N KUBE-SEP-CNB2ICYJPHYJI3AY
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090
control-plane-mjnp4:/home/eccd #
eccd@control-plane-mjnp4:~> kubectl get po -l app=eric-pm-server -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
evnfm eric-pm-server-0 2/2 Running 0 4h29m 192.168.188.101 pool2-6859c57999-mzv2d <none> <none>
monitoring eric-pm-server-0 2/2 Running 0 2d1h 192.168.22.247 pool2-6859c57999-7hfs2 <none> <none>
eccd@control-plane-mjnp4:~>
Since the server in question is in monitoring namespace, the correct PM server shall be the second one.
<--------Third load balancing by K8s service - round-robin using iptables mode-------->
There is only one eric-pm-server pod so there is no round-robin load balancing.
eccd@control-plane-mjnp4:~> kubectl get no -o wide | grep pool2-6859c57999-7hfs2
pool2-6859c57999-7hfs2 Ready worker 19d v1.18.8 10.0.10.105 <none> SUSE Linux Enterprise Server 15 SP1 4.12.14-197.51-default docker://19.3.11
eccd@control-plane-mjnp4:~> ssh 10.0.10.105 -q
Last login: Mon Sep 21 16:27:26 2020 from 10.0.10.101
eccd@pool2-6859c57999-7hfs2:~> sudo su
pool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep 192.168.22.247
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090
pool2-6859c57999-7hfs2:/home/eccd #
Traffic enters the correct pod and we will now look at the return path.
iii. Traffic will be set to mask 0x8000 if not matched by any of above rules.
control-plane-mjnp4:/home/eccd # iptables -t nat -S|grep KUBE-MARK-DROP | grep -v "j KUBE-MARK-DROP"
-N KUBE-MARK-DROP
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
control-plane-mjnp4:/home/eccd #
eccd@control-plane-wfb88:~> kubectl get ing -A
NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE
monitoring prometheus-server <none> dc173prometheus1.cloud.k2.ericsson.se 10.0.10.102,10.0.10.103,10.0.10.104,10.0.10.108 80 28d
==============
return path
Return path will be using default route if there is no explicit route or existing route for dst=10.210.150.87.
1. Start point - return traffic with source IP 192.168.22.247/32 will be sent to KUBE-MARK-MASQ chain
pool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep 192.168.22.247
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment —comment “monitoring/eric-pm-server:http” -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment —comment “monitoring/eric-pm-server:http” -m tcp -j DNAT —to-destination 192.168.22.247:9090
pool2-6859c57999-7hfs2:/home/eccd #
2. All return traffic are set with mask 0x4000.
A KUBE-MARK-MASQ -j MARK —set-xmark 0x4000/0x4000
Chain KUBE-MARK-MASQ (242 references)
target prot opt source destination
MARK all -- anywhere anywhere MARK or 0x4000
3. traffic with UBE-MARK-MASQ mark will return to chain CNB2ICYJPHYJI3AY and check what's next rule - no more rules in chain KUBE-SEP-CNB2ICYJPHYJI3AY
pool2-6859c57999-7hfs2:/home/eccd # iptables -t nat -S | grep KUBE-SEP-CNB2ICYJPHYJI3AY
-N KUBE-SEP-CNB2ICYJPHYJI3AY
-A KUBE-SEP-CNB2ICYJPHYJI3AY -s 192.168.22.247/32 -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-CNB2ICYJPHYJI3AY -p tcp -m comment --comment "monitoring/eric-pm-server:http" -m tcp -j DNAT --to-destination 192.168.22.247:9090
-A KUBE-SVC-6K4R6JTKTVNBU5UP -m comment --comment "monitoring/eric-pm-server:http" -j KUBE-SEP-CNB2ICYJPHYJI3AY
pool2-6859c57999-7hfs2:/home/eccd #
4. KUBE-MARK-MASQ adds a Netfilter mark to packets originated from the eric-pm-serve service which destined outside the cluster’s network. Packets with this mark will be altered in a POSTROUTING rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.
From <https://www.stackrox.com/post/2020/01/kubernetes-networking-demystified/>
5. Destination IP is not a the explicit routing entries, use default route
pool2-6859c57999-7hfs2:/home/eccd # ip r | grep 10.210.150.87
pool2-6859c57999-7hfs2:/home/eccd #
pool2-6859c57999-7hfs2:/home/eccd # ip r | grep default
default via 10.33.152.65 dev ecfe_traf1 proto static metric 804
pool2-6859c57999-7hfs2:/home/eccd #
pool2-6859c57999-7hfs2:/home/eccd # ip a show ecfe_traf1
12: ecfe_traf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 36:1f:d1:91:3a:16 brd ff:ff:ff:ff:ff:ff
inet 10.33.152.74/27 brd 10.33.152.95 scope global noprefixroute ecfe_traf1
valid_lft forever preferred_lft forever
pool2-6859c57999-7hfs2:/home/eccd #
At this stage, IP-SA= 10.33.152.74, IP-DA=10.210.150.87.
Since the packet ingress at DC173_CCD1_om_vr, the egress traffic must go back to the same VRF.
CNIS 1.0 default route belongs to application, therefore the egress traffic shall be forwarded via ccd_ecfe_om interface.
Ingress and egress traffic are routed via different VRF, TCP session won't be established.
Add route(temporary workaround, not persistent):
for i in `kubectl get no -o json | jq -r '.items[].status.addresses[] | select(.type=="InternalIP") | .address'` ; do ssh $i -q sudo ip route add <web-browser-ip>/32 via <ccd_ecfe-SLX-om_vr-ve_anycast> ; done
Egress traffic will be
Prometheus clusterip(iptables+conntrack) -> nginx ingress controller pod(-> srcIP=ingress-pod-ip) -> server host iptables hosting nginx controller pod(-> srcIP=ingress-ip) > lookup its host OS routing table -> ecfe gw(anycast GW IP in SLX) -> 10.210.150.87
src-IP = Promethers pod