TC 介绍

TC全称「Traffic Control」,直译过来是「流量控制」,在这个领域,你可能更熟悉的是Linux iptables或者netfilter,它们都能做packet mangling,而TC更专注于packet scheduler,所谓的网络包调度器,调度网络包的延迟、丢失、传输顺序和速度控制。
在测试领域里面可以通过linux tc 工具对指定接口进行流量配置,例如增加接口延迟,配置接口丢包率,这样可以模拟复杂网络环境提供工具。通过tc

TC 调度结构

TC有4大组件:

  • Queuing disciplines,简称为qdisc,直译是「队列规则」,它的本质是一个带有算法的队列,默认的算法是FIFO,形成了一个最简单的流量调度器。
  • Class,直译是「种类」,它的本质是为上面的qdisc进行分类。因为现实情况下会有很多qdisc存在,每种qdisc有它特殊的职责,根据职责的不同,可以对qdisc进行分类。
  • Filters,直译是「过滤器」,它是用来过滤传入的网络包,使它们进入到对应class的qdisc中去。
  • Policers,直译是「规则器」,它其实是filter的跟班,通常会紧跟着filter出现,定义命中filter后网络包的后继操作,如丢弃、延迟或限速。

例如下面模拟网络延迟和丢包代码

  1. #!/bin/bash
  2. interface=lo
  3. ip=127.0.0.1
  4. # 延迟配置
  5. delay=100ms
  6. # 丢包配置
  7. loss=50%
  8. tc qdisc add dev $interface root handle 1: prio
  9. tc filter add dev $interface parent 1:0 protocol ip prio 1 u32 match ip dst $ip flowid 2:1
  10. tc qdisc add dev $interface parent 1:1 handle 2: netem delay $delay loss $loss

TC-flow.png
那么TC是怎么和BPF联系在一起的呢?
从内核4.1版本起,引入了一个特殊的qdisc,叫做clsact,它为TC提供了一个可以加载BPF程序的入口,使TC和XDP一样,成为一个可以加载BPF程序的网络钩子。

TC vs XDP

这两个钩子都可以用于相同的应用场景,如DDoS缓解、隧道、处理链路层信息等。但是,由于XDP在任何套接字缓冲区(SKB)分配之前运行,所以它可以达到比TC上的程序更高的吞吐量值。然而,后者可以从通过 struct __sk_buff 提供的额外的解析数据中受益,并且可以执行 BPF 程序,对入站流量和出站流量都可以执行 BPF 程序,是 TX 链路上的能被操控的最一层。

对于在容器网络来说,是在容器网络接口对端 ,主机一端veth 上加入TC BPF。 对于这个BPF, RX是 from-container 数据, TX是 to-container 数据。在TX端对发送容器数据进行过滤,RX端是容器发送出数据进行过滤。同样XDP BPF也是配置在容器对端接口,XDP只能对容器发送外网,其他容器数据进行监控。

TC内核代码结构

TC接受单个输入参数,类型为struct __sk_buff。这个结构是一种UAPI(user space API of the kernel),允许访问内核中socket buffer内部数据结构中的某些字段。它具有与 struct xdp_md 相同意义两个指针,datadata_end,同时还有更多信息可以获取,这是因为在TC层面上,内核已经解析了数据包以提取与协议相关的元数据,因此传递给BPF程序的上下文信息更加丰富。结构 __sk_buff 的整个声明如下所说,可以在 include/uapi/linux/bpf.h 文件中看到,下面是结构体的定义,比XDP的要多出很多信息,这就是为什么说TC层的吞吐量要比XDP小了,因为实例化一堆信息需要很大的cost。

  1. * user accessible mirror of in-kernel sk_buff.
  2. * new fields can only be added to the end of this structure
  3. */
  4. struct __sk_buff {
  5. __u32 len;
  6. __u32 pkt_type;
  7. __u32 mark;
  8. __u32 queue_mapping;
  9. __u32 protocol;
  10. __u32 vlan_present;
  11. __u32 vlan_tci;
  12. __u32 vlan_proto;
  13. __u32 priority;
  14. __u32 ingress_ifindex;
  15. __u32 ifindex;
  16. __u32 tc_index;
  17. __u32 cb[5];
  18. __u32 hash;
  19. __u32 tc_classid;
  20. __u32 data;
  21. __u32 data_end;
  22. __u32 napi_id;
  23. /* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */
  24. __u32 family;
  25. __u32 remote_ip4; /* Stored in network byte order */
  26. __u32 local_ip4; /* Stored in network byte order */
  27. __u32 remote_ip6[4]; /* Stored in network byte order */
  28. __u32 local_ip6[4]; /* Stored in network byte order */
  29. __u32 remote_port; /* Stored in network byte order */
  30. __u32 local_port; /* stored in host byte order */
  31. /* ... here. */
  32. __u32 data_meta;
  33. __bpf_md_ptr(struct bpf_flow_keys *, flow_keys);
  34. __u64 tstamp;
  35. __u32 wire_len;
  36. __u32 gso_segs;
  37. __bpf_md_ptr(struct bpf_sock *, sk);
  38. };

TC输出参数

和XDP一样,TC的输出代表了数据包如何被处置的一种动作。它的定义在include/uapi/linux/pkt_cls.h找到。最新的内核版本里定义了9种动作,其本质是int类型的值,以下是5种常用动作:

0 TC_ACT_OK 允许网络包传送到TC queue
2 TC_ACT_SHOT 丢弃网络包
-1 TC_ACT_USPEC Use standard TC action
3 TC_ACT_PIPE Perform the next action, if it exists
1 TC_ACT_RECLASSIFY Restarts the classification from begining

设计你的第一个TC程序

为了更贴近系列文章的初心——了解并学习容器网络Cilium的工作原理,我们这次拿容器实例作为流控目标。在实验环境上通过docker run运行一个Nginx服务:

  1. $ docker run -d -p 80:80 --name=nginx-xdp nginx:alpine

下面获取容器接口

  1. $ ip a | grep veth
  2. ...
  3. 9: vethf87805f@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default qlen 1000

TC 对发送到容器数据包过滤

设计一个TC BPF 程序对发送到容器包进行过滤,首先过滤条件如下

  • 允许docker容器内部网段对这个容器进行访问
  • 禁止外部主机访问对这个容器tcp进行访问

外网络访问容器时候,都是docker0 网桥转发到容器对端 veth 接口上,由于通过网桥做路由器所有在BPF上src 地址就是docker0网桥地址,只要是网桥地址发送过来的。 如图:
tc-program-egress.png

tc-ban-outside-network-vist-container-debug.c

  1. #include <stdbool.h>
  2. #include <linux/bpf.h>
  3. #include <linux/if_ether.h>
  4. #include <linux/ip.h>
  5. #include <linux/in.h>
  6. #include <linux/pkt_cls.h>
  7. #include <stdio.h>
  8. #include "bpf_endian.h"
  9. #include "bpf_helpers.h"
  10. typedef unsigned int u32;
  11. #define bpfprint(fmt, ...) \
  12. ({ \
  13. char ____fmt[] = fmt; \
  14. bpf_trace_printk(____fmt, sizeof(____fmt), \
  15. ##__VA_ARGS__); \
  16. })
  17. /*
  18. check whether the packet is of TCP protocol
  19. */
  20. static __inline bool ban_network_outside_to_visit(void *data_begin, void *data_end){
  21. bpfprint("Entering ban_network_outside_to_visit \n");
  22. struct ethhdr *eth = data_begin;
  23. // Check packet's size
  24. // the pointer arithmetic is based on the size of data type, current_address plus int(1) means:
  25. // new_address= current_address + size_of(data type)
  26. if ((void *)(eth + 1) > data_end) //
  27. return false;
  28. // Check if Ethernet frame has IP packet
  29. if (eth->h_proto == bpf_htons(ETH_P_IP))
  30. {
  31. struct iphdr *iph = (struct iphdr *)(eth + 1); // or (struct iphdr *)( ((void*)eth) + ETH_HLEN );
  32. if ((void *)(iph + 1) > data_end)
  33. return false;
  34. // Check if IP packet contains a TCP segment
  35. if (iph->protocol != IPPROTO_TCP)
  36. return false;
  37. // extract src ip and destination ip
  38. u32 ip_src = iph->saddr;
  39. u32 ip_dst = iph->daddr;
  40. bpfprint("src ip addr1: %d.%d.%d\n",(ip_src) & 0xFF,(ip_src >> 8) & 0xFF,(ip_src >> 16) & 0xFF);
  41. bpfprint("src ip addr2:.%d\n",(ip_src >> 24) & 0xFF);
  42. bpfprint("dest ip addr1: %d.%d.%d\n",(ip_dst) & 0xFF,(ip_dst >> 8) & 0xFF,(ip_dst >> 16) & 0xFF);
  43. bpfprint("dest ip addr2: .%d\n",(ip_dst >> 24) & 0xFF);
  44. // docker0 addr 172.17.0.1
  45. u32 route_from_docker0_addr = bpf_htonl(0xac110001);
  46. // outsize network use docker0 route to container network
  47. if (ip_src == route_from_docker0_addr) {
  48. return true;
  49. }
  50. }
  51. return false;
  52. }
  53. SEC("to-container")
  54. int tc_to_container(struct __sk_buff *skb)
  55. {
  56. bpfprint("Entering to-container sec\n");
  57. void *data = (void *)(long)skb->data;
  58. void *data_end = (void *)(long)skb->data_end;
  59. if (ban_network_outside_to_visit(data, data_end))
  60. return TC_ACT_SHOT;
  61. else
  62. return TC_ACT_OK;
  63. }
  64. char _license[] SEC("license") = "GPL";

头文件:
headers/bpf_endian.h

  1. /* SPDX-License-Identifier: GPL-2.0 */
  2. /* Copied from $(LINUX)/tools/testing/selftests/bpf/bpf_endian.h */
  3. #ifndef __BPF_ENDIAN__
  4. #define __BPF_ENDIAN__
  5. #include <linux/swab.h>
  6. /* LLVM's BPF target selects the endianness of the CPU
  7. * it compiles on, or the user specifies (bpfel/bpfeb),
  8. * respectively. The used __BYTE_ORDER__ is defined by
  9. * the compiler, we cannot rely on __BYTE_ORDER from
  10. * libc headers, since it doesn't reflect the actual
  11. * requested byte order.
  12. *
  13. * Note, LLVM's BPF target has different __builtin_bswapX()
  14. * semantics. It does map to BPF_ALU | BPF_END | BPF_TO_BE
  15. * in bpfel and bpfeb case, which means below, that we map
  16. * to cpu_to_be16(). We could use it unconditionally in BPF
  17. * case, but better not rely on it, so that this header here
  18. * can be used from application and BPF program side, which
  19. * use different targets.
  20. */
  21. #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  22. # define __bpf_ntohs(x)__builtin_bswap16(x)
  23. # define __bpf_htons(x)__builtin_bswap16(x)
  24. # define __bpf_constant_ntohs(x)___constant_swab16(x)
  25. # define __bpf_constant_htons(x)___constant_swab16(x)
  26. # define __bpf_ntohl(x)__builtin_bswap32(x)
  27. # define __bpf_htonl(x)__builtin_bswap32(x)
  28. # define __bpf_constant_ntohl(x)___constant_swab32(x)
  29. # define __bpf_constant_htonl(x)___constant_swab32(x)
  30. #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
  31. # define __bpf_ntohs(x)(x)
  32. # define __bpf_htons(x)(x)
  33. # define __bpf_constant_ntohs(x)(x)
  34. # define __bpf_constant_htons(x)(x)
  35. # define __bpf_ntohl(x)(x)
  36. # define __bpf_htonl(x)(x)
  37. # define __bpf_constant_ntohl(x)(x)
  38. # define __bpf_constant_htonl(x)(x)
  39. #else
  40. # error "Fix your compiler's __BYTE_ORDER__?!"
  41. #endif
  42. #define bpf_htons(x)\
  43. (__builtin_constant_p(x) ?\
  44. __bpf_constant_htons(x) : __bpf_htons(x))
  45. #define bpf_ntohs(x)\
  46. (__builtin_constant_p(x) ?\
  47. __bpf_constant_ntohs(x) : __bpf_ntohs(x))
  48. #define bpf_htonl(x)\
  49. (__builtin_constant_p(x) ?\
  50. __bpf_constant_htonl(x) : __bpf_htonl(x))
  51. #define bpf_ntohl(x)\
  52. (__builtin_constant_p(x) ?\
  53. __bpf_constant_ntohl(x) : __bpf_ntohl(x))
  54. #endif /* __BPF_ENDIAN__ */

headers/bpf_helper.h

  1. /* SPDX-License-Identifier: GPL-2.0 */
  2. /* Copied from $(LINUX)/tools/testing/selftests/bpf/bpf_endian.h */
  3. #ifndef __BPF_ENDIAN__
  4. #define __BPF_ENDIAN__
  5. #include <linux/swab.h>
  6. /* LLVM's BPF target selects the endianness of the CPU
  7. * it compiles on, or the user specifies (bpfel/bpfeb),
  8. * respectively. The used __BYTE_ORDER__ is defined by
  9. * the compiler, we cannot rely on __BYTE_ORDER from
  10. * libc headers, since it doesn't reflect the actual
  11. * requested byte order.
  12. *
  13. * Note, LLVM's BPF target has different __builtin_bswapX()
  14. * semantics. It does map to BPF_ALU | BPF_END | BPF_TO_BE
  15. * in bpfel and bpfeb case, which means below, that we map
  16. * to cpu_to_be16(). We could use it unconditionally in BPF
  17. * case, but better not rely on it, so that this header here
  18. * can be used from application and BPF program side, which
  19. * use different targets.
  20. */
  21. #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  22. # define __bpf_ntohs(x)__builtin_bswap16(x)
  23. # define __bpf_htons(x)__builtin_bswap16(x)
  24. # define __bpf_constant_ntohs(x)___constant_swab16(x)
  25. # define __bpf_constant_htons(x)___constant_swab16(x)
  26. # define __bpf_ntohl(x)__builtin_bswap32(x)
  27. # define __bpf_htonl(x)__builtin_bswap32(x)
  28. # define __bpf_constant_ntohl(x)___constant_swab32(x)
  29. # define __bpf_constant_htonl(x)___constant_swab32(x)
  30. #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
  31. # define __bpf_ntohs(x)(x)
  32. # define __bpf_htons(x)(x)
  33. # define __bpf_constant_ntohs(x)(x)
  34. # define __bpf_constant_htons(x)(x)
  35. # define __bpf_ntohl(x)(x)
  36. # define __bpf_htonl(x)(x)
  37. # define __bpf_constant_ntohl(x)(x)
  38. # define __bpf_constant_htonl(x)(x)
  39. #else
  40. # error "Fix your compiler's __BYTE_ORDER__?!"
  41. #endif
  42. #define bpf_htons(x)\
  43. (__builtin_constant_p(x) ?\
  44. __bpf_constant_htons(x) : __bpf_htons(x))
  45. #define bpf_ntohs(x)\
  46. (__builtin_constant_p(x) ?\
  47. __bpf_constant_ntohs(x) : __bpf_ntohs(x))
  48. #define bpf_htonl(x)\
  49. (__builtin_constant_p(x) ?\
  50. __bpf_constant_htonl(x) : __bpf_htonl(x))
  51. #define bpf_ntohl(x)\
  52. (__builtin_constant_p(x) ?\
  53. __bpf_constant_ntohl(x) : __bpf_ntohl(x))
  54. #endif /* __BPF_ENDIAN__ */
  55. [root@node20 tc-xdp]# cat headers/bpf_
  56. bpf_endian.h bpf_helpers.h
  57. [root@node20 tc-xdp]# cat headers/bpf_helpers.h
  58. /* SPDX-License-Identifier: GPL-2.0 */
  59. /* Copied from $(LINUX)/tools/testing/selftests/bpf/bpf_helpers.h */
  60. #ifndef __BPF_HELPERS_H
  61. #define __BPF_HELPERS_H
  62. /* helper macro to place programs, maps, license in
  63. * different sections in elf_bpf file. Section names
  64. * are interpreted by elf_bpf loader
  65. */
  66. #define SEC(NAME) __attribute__((section(NAME), used))
  67. #ifndef __inline
  68. # define __inline \
  69. inline __attribute__((always_inline))
  70. #endif
  71. /* helper functions called from eBPF programs written in C */
  72. static void *(*bpf_map_lookup_elem)(void *map, void *key) =
  73. (void *) BPF_FUNC_map_lookup_elem;
  74. static int (*bpf_map_update_elem)(void *map, void *key, void *value,
  75. unsigned long long flags) =
  76. (void *) BPF_FUNC_map_update_elem;
  77. static int (*bpf_map_delete_elem)(void *map, void *key) =
  78. (void *) BPF_FUNC_map_delete_elem;
  79. static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
  80. (void *) BPF_FUNC_probe_read;
  81. static unsigned long long (*bpf_ktime_get_ns)(void) =
  82. (void *) BPF_FUNC_ktime_get_ns;
  83. static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
  84. (void *) BPF_FUNC_trace_printk;
  85. static void (*bpf_tail_call)(void *ctx, void *map, int index) =
  86. (void *) BPF_FUNC_tail_call;
  87. static unsigned long long (*bpf_get_smp_processor_id)(void) =
  88. (void *) BPF_FUNC_get_smp_processor_id;
  89. static unsigned long long (*bpf_get_current_pid_tgid)(void) =
  90. (void *) BPF_FUNC_get_current_pid_tgid;
  91. static unsigned long long (*bpf_get_current_uid_gid)(void) =
  92. (void *) BPF_FUNC_get_current_uid_gid;
  93. static int (*bpf_get_current_comm)(void *buf, int buf_size) =
  94. (void *) BPF_FUNC_get_current_comm;
  95. static unsigned long long (*bpf_perf_event_read)(void *map,
  96. unsigned long long flags) =
  97. (void *) BPF_FUNC_perf_event_read;
  98. static int (*bpf_clone_redirect)(void *ctx, int ifindex, int flags) =
  99. (void *) BPF_FUNC_clone_redirect;
  100. static int (*bpf_redirect)(int ifindex, int flags) =
  101. (void *) BPF_FUNC_redirect;
  102. static int (*bpf_perf_event_output)(void *ctx, void *map,
  103. unsigned long long flags, void *data,
  104. int size) =
  105. (void *) BPF_FUNC_perf_event_output;
  106. static int (*bpf_get_stackid)(void *ctx, void *map, int flags) =
  107. (void *) BPF_FUNC_get_stackid;
  108. static int (*bpf_probe_write_user)(void *dst, void *src, int size) =
  109. (void *) BPF_FUNC_probe_write_user;
  110. static int (*bpf_current_task_under_cgroup)(void *map, int index) =
  111. (void *) BPF_FUNC_current_task_under_cgroup;
  112. static int (*bpf_skb_get_tunnel_key)(void *ctx, void *key, int size, int flags) =
  113. (void *) BPF_FUNC_skb_get_tunnel_key;
  114. static int (*bpf_skb_set_tunnel_key)(void *ctx, void *key, int size, int flags) =
  115. (void *) BPF_FUNC_skb_set_tunnel_key;
  116. static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int size) =
  117. (void *) BPF_FUNC_skb_get_tunnel_opt;
  118. static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
  119. (void *) BPF_FUNC_skb_set_tunnel_opt;
  120. static unsigned long long (*bpf_get_prandom_u32)(void) =
  121. (void *) BPF_FUNC_get_prandom_u32;
  122. static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
  123. (void *) BPF_FUNC_xdp_adjust_head;
  124. /* llvm builtin functions that eBPF C program may use to
  125. * emit BPF_LD_ABS and BPF_LD_IND instructions
  126. */
  127. struct sk_buff;
  128. unsigned long long load_byte(void *skb,
  129. unsigned long long off) asm("llvm.bpf.load.byte");
  130. unsigned long long load_half(void *skb,
  131. unsigned long long off) asm("llvm.bpf.load.half");
  132. unsigned long long load_word(void *skb,
  133. unsigned long long off) asm("llvm.bpf.load.word");
  134. /* a helper structure used by eBPF C program
  135. * to describe map attributes to elf_bpf loader
  136. */
  137. struct bpf_map_def {
  138. unsigned int type;
  139. unsigned int key_size;
  140. unsigned int value_size;
  141. unsigned int max_entries;
  142. unsigned int map_flags;
  143. unsigned int inner_map_idx;
  144. };
  145. static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
  146. (void *) BPF_FUNC_skb_load_bytes;
  147. static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
  148. (void *) BPF_FUNC_skb_store_bytes;
  149. static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
  150. (void *) BPF_FUNC_l3_csum_replace;
  151. static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) =
  152. (void *) BPF_FUNC_l4_csum_replace;
  153. static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
  154. (void *) BPF_FUNC_skb_under_cgroup;
  155. static int (*bpf_skb_change_head)(void *, int len, int flags) =
  156. (void *) BPF_FUNC_skb_change_head;
  157. #if defined(__x86_64__)
  158. #define PT_REGS_PARM1(x) ((x)->di)
  159. #define PT_REGS_PARM2(x) ((x)->si)
  160. #define PT_REGS_PARM3(x) ((x)->dx)
  161. #define PT_REGS_PARM4(x) ((x)->cx)
  162. #define PT_REGS_PARM5(x) ((x)->r8)
  163. #define PT_REGS_RET(x) ((x)->sp)
  164. #define PT_REGS_FP(x) ((x)->bp)
  165. #define PT_REGS_RC(x) ((x)->ax)
  166. #define PT_REGS_SP(x) ((x)->sp)
  167. #define PT_REGS_IP(x) ((x)->ip)
  168. #elif defined(__s390x__)
  169. #define PT_REGS_PARM1(x) ((x)->gprs[2])
  170. #define PT_REGS_PARM2(x) ((x)->gprs[3])
  171. #define PT_REGS_PARM3(x) ((x)->gprs[4])
  172. #define PT_REGS_PARM4(x) ((x)->gprs[5])
  173. #define PT_REGS_PARM5(x) ((x)->gprs[6])
  174. #define PT_REGS_RET(x) ((x)->gprs[14])
  175. #define PT_REGS_FP(x) ((x)->gprs[11]) /* Works only with CONFIG_FRAME_POINTER */
  176. #define PT_REGS_RC(x) ((x)->gprs[2])
  177. #define PT_REGS_SP(x) ((x)->gprs[15])
  178. #define PT_REGS_IP(x) ((x)->psw.addr)
  179. #elif defined(__aarch64__)
  180. #define PT_REGS_PARM1(x) ((x)->regs[0])
  181. #define PT_REGS_PARM2(x) ((x)->regs[1])
  182. #define PT_REGS_PARM3(x) ((x)->regs[2])
  183. #define PT_REGS_PARM4(x) ((x)->regs[3])
  184. #define PT_REGS_PARM5(x) ((x)->regs[4])
  185. #define PT_REGS_RET(x) ((x)->regs[30])
  186. #define PT_REGS_FP(x) ((x)->regs[29]) /* Works only with CONFIG_FRAME_POINTER */
  187. #define PT_REGS_RC(x) ((x)->regs[0])
  188. #define PT_REGS_SP(x) ((x)->sp)
  189. #define PT_REGS_IP(x) ((x)->pc)
  190. #elif defined(__powerpc__)
  191. #define PT_REGS_PARM1(x) ((x)->gpr[3])
  192. #define PT_REGS_PARM2(x) ((x)->gpr[4])
  193. #define PT_REGS_PARM3(x) ((x)->gpr[5])
  194. #define PT_REGS_PARM4(x) ((x)->gpr[6])
  195. #define PT_REGS_PARM5(x) ((x)->gpr[7])
  196. #define PT_REGS_RC(x) ((x)->gpr[3])
  197. #define PT_REGS_SP(x) ((x)->sp)
  198. #define PT_REGS_IP(x) ((x)->nip)
  199. #elif defined(__sparc__)
  200. #define PT_REGS_PARM1(x) ((x)->u_regs[UREG_I0])
  201. #define PT_REGS_PARM2(x) ((x)->u_regs[UREG_I1])
  202. #define PT_REGS_PARM3(x) ((x)->u_regs[UREG_I2])
  203. #define PT_REGS_PARM4(x) ((x)->u_regs[UREG_I3])
  204. #define PT_REGS_PARM5(x) ((x)->u_regs[UREG_I4])
  205. #define PT_REGS_RET(x) ((x)->u_regs[UREG_I7])
  206. #define PT_REGS_RC(x) ((x)->u_regs[UREG_I0])
  207. #define PT_REGS_SP(x) ((x)->u_regs[UREG_FP])
  208. #if defined(__arch64__)
  209. #define PT_REGS_IP(x) ((x)->tpc)
  210. #else
  211. #define PT_REGS_IP(x) ((x)->pc)
  212. #endif
  213. #endif
  214. #ifdef __powerpc__
  215. #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ (ip) = (ctx)->link; })
  216. #define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP
  217. #elif defined(__sparc__)
  218. #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ (ip) = PT_REGS_RET(ctx); })
  219. #define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP
  220. #else
  221. #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ \
  222. bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); })
  223. #define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({ \
  224. bpf_probe_read(&(ip), sizeof(ip), \
  225. (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
  226. #endif
  227. #endif

编译代码

跟XDP程序一样,可以使用clang进行编译,不同之处是由于引用了本地头文件,所以需要加上-I参数,指定头文件所在目录:

clang -I ./headers/ -O2 -target bpf -c tc-ban-outside-network-vist-container-debug.c  -o tc-ban-outside-network-vist-container-debug.o

加载BPF到TC egress上

BPF可以通过 tc 工具加载到TC上,上文提到的了TC控制的单元是qdisc,用来加载BPF程序是个特殊的qdiscclsact,示例命令如下所示:

# 为目标网卡创建 clsact
tc qdisc add dev [network-device] clsact
# 加载bpf程序
tc filter add dev [network-device] <direction> bpf da obj [object-name] sec [section-name]
# 查看
tc filter show dev [network-device] <direction>
# 帮助
tc filter help

简单说明下:

  • 示例中有个参数,它表示将bpf程序加载到哪条网络链路上,它的值可以是ingress和egress。
  • 还有一个不起眼的参数da,它的全称是direct-action。查看帮助文档:
    direct-action | da
    instructs eBPF classifier to not invoke external TC actions, instead use the TC actions return codes (TC_ACT_OK, TC_ACT_SHOT etc.) for classifiers.
    
    下面是实现TC BPF控制Egress的真真实命令: ```basic

    为目标网卡创建 clsact

    $ tc qdisc add dev vethf87805f clasact

    加载bpf程序到esgress

    $ tc filter add dev vethf87805f egress bpf da obj tc-ban-outside-network-vist-container-debug.o sec to-container

查看结果

$ tc filter show dev vethf87805f egress filter protocol all pref 49152 bpf chain 0 filter protocol all pref 49152 bpf chain 0 handle 0x1 tc-ban-outside-network-vist-container-debug.o:[to-container] direct-action not_in_hw id 32 tag f1ca072d3dd58af4 jited

<a name="vZlLx"></a>
#### 测试效果
在主机上访问容器端口

加入TC esgress bpf前
```bash
$ curl 172.17.0.2

$ curl 127.0.0.1
# 都是可以正常返回

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

加入TC esgress bpf 后

$ curl 172.17.0.2

$ curl 127.0.0.1
# 都会卡住

在容器里面可以正常返回的

$ sudo docker run -it --rm busybox sh
$ container>  wget 172.17.0.2 
Connecting to 172.17.0.2 (172.17.0.2:80)
saving to 'index.html'
index.html           100% |*******************************************************************************************************************************************************************|   612  0:00:00 ETA
'index.html' saved

可以实现禁止容器外部环境访问,同时允许容器网段内部访问这个nginx-xdp端口

卸载TC上 egress BPF

$ tc filter del dev vethf87805f egress

TC 对容器访问外网进行过滤

容器接口对端 veth ingress 对应接收容器接口发送出来数据 from-containter ,可以对容器对外访问进行控制。可以尝试下面限制

  • 容器tcp可以访问172.17.0.1/16 容器网段所有节点。
  • 容器不允许通过tcp访问容器网段外网络。
  • 外部访问可以运行访问容器端口。

tc-program-ingress.png

可能会有疑问,不允许容器访问外网,但是外网访问容器时候包怎么过去。外网访问容器时候是通过 docker0进行路由转发的:

$ ip route 
...
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
...

所有外部网络访问容器网络时候src ip 172.17.0.1 (host机器网络需要通过docker0 网桥路由容器网络),返回时候dst ip 172.17.0.1 。如果容器直接访问,dst ip 就是外网ip地址, 同样可以判断dst ip 是否是容器网络可以了。代码tc-ban-container-visit-outsize-network-debug.c 如下:

#include <stdbool.h>
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/in.h>
#include <linux/pkt_cls.h>
#include <stdio.h>

#include "bpf_endian.h"
#include "bpf_helpers.h"

typedef unsigned int        u32;
#define bpfprint(fmt, ...)                        \
    ({                                             \
        char ____fmt[] = fmt;                      \
        bpf_trace_printk(____fmt, sizeof(____fmt), \
                         ##__VA_ARGS__);           \
    })

/*
  check whether the packet is of TCP protocol
*/
static __inline bool ban_container_visit_outsize_network(void *data_begin, void *data_end){
  bpfprint("Entering ban_container_visit_outsize_network \n");
  struct ethhdr *eth = data_begin;

  // Check packet's size
  // the pointer arithmetic is based on the size of data type, current_address plus int(1) means:
  // new_address= current_address + size_of(data type)
  if ((void *)(eth + 1) > data_end) //
    return false;

  // Check if Ethernet frame has IP packet
  if (eth->h_proto == bpf_htons(ETH_P_IP))
  {
    struct iphdr *iph = (struct iphdr *)(eth + 1); // or (struct iphdr *)( ((void*)eth) + ETH_HLEN );
    if ((void *)(iph + 1) > data_end)
      return false;


    // Check if IP packet contains a TCP segment
    if (iph->protocol != IPPROTO_TCP)
      return false;

    // extract src ip and destination ip
    u32 ip_src = iph->saddr;
    u32 ip_dst = iph->daddr;

    bpfprint("src ip addr1: %d.%d.%d\n",(ip_src) & 0xFF,(ip_src >> 8) & 0xFF,(ip_src >> 16) & 0xFF);
    bpfprint("src ip addr2:.%d\n",(ip_src >> 24) & 0xFF);

    bpfprint("dest ip addr1: %d.%d.%d\n",(ip_dst) & 0xFF,(ip_dst >> 8) & 0xFF,(ip_dst >> 16) & 0xFF);
    bpfprint("dest ip addr2: .%d\n",(ip_dst >> 24) & 0xFF);

    // net mask 255.255.255.0
    u32 mask = bpf_htonl(0xffffff00);
    // check dst if is container network
    u32 dst_net_seg = ip_dst & mask;
    // docker container network: 172.17.0.0
    u32 container_net_seg = bpf_htonl(0xac110000);

    // dst net segment is not in container network 
    if (dst_net_seg != container_net_seg) {
      return true;
    }

  }
  return false;
}


SEC("from-container")
int tc_to_container(struct __sk_buff *skb)
{

  bpfprint("Entering from-container section\n");
  void *data = (void *)(long)skb->data;
  void *data_end = (void *)(long)skb->data_end;


  if (ban_container_visit_outsize_network(data, data_end))
    return TC_ACT_SHOT;
  else
    return TC_ACT_OK;
}

char _license[] SEC("license") = "GPL";

编译代码

$ clang -I ./headers/ -O2 -target bpf -c tc-  -o tc-ban-outside-network-vist-container-debug.o

加载BPF到 TC ingress 上

这个把 BPF 加入到 tc ingress 链路上

# 为目标网卡创建 clsact [如果之前已经加入可以忽略这条命令]
$ tc qdisc add dev vethf87805f clasact

# 加载bpf程序到esgress
$ tc filter add dev vethf87805f egress bpf da obj tc-ban-outside-network-vist-container-debug.o sec to-container

# 查看结果
$ tc filter show dev vethf87805f egress

测试效果

xdp-nginx 容器访问外网服务

$ sudo docker exec -it nginx-xdp sh
container-nginx>$ curl baidu.com -vvv
*   Trying 220.181.38.148:80...
* connect to 220.181.38.148 port 80 failed: Operation timed out
*   Trying 39.156.69.79:80...
* After 83733ms connect time, move on!
* connect to 39.156.69.79 port 80 failed: Operation timed out
* Failed to connect to baidu.com port 80: Operation timed out
* Closing connection 0
curl: (28) Failed to connect to baidu.com port 80: Operation timed out


新启动nginx容器

$ sudo docker run -d --name nginx-test nginx:alpine
$ sudo docker exec -it nginx-test ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

在容器里面访问nginx-test容器

container-nginx> / # curl 172.17.0.3
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

正常实现上面指定功能。

BPF 程序调试

BPF 程序通过ip工具/tc工具接入以后,都是在后台运行的,如果有bug,应用不符合要求时候怎么去调试。用gdb,这个内核无法跟踪,尝试通过打印日志方式看程序运行中间状态。下面介绍一下 bpf 怎么打印日志,日志输出到什么文件呢。

添加打印日志

在适当的位置添加printf函数,但由于这个函数需要在内核运行,而BPF中没有实现它,因此无法使用。事实上,BPF程序能的使用的C语言库数量有限,并且不支持调用外部库。

使用辅助函数(helper function)

为了克服这个限制,最常用的一种方法是定义和使用BPF辅助函数,即helper function。比如可以使用bpf_trace_printk()辅助函数,这个函数可以根据用户定义的输出,将BPF程序产生的对应日志消息保存在用来跟踪内核的文件夹(/sys/kernel/debug/tracing/),这样,我们就可以通过这些日志信息,分析和发现BPF程序执行过程中可能出现的错误。

下面代码封装打印函数代码

#include <stdbool.h>
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/in.h>
#include <linux/pkt_cls.h>
#include <stdio.h>
#include "bpf_endian.h"
#include "bpf_helpers.h"
typedef unsigned int    u32;
#define bpfprint(fmt, ...)                        \
    ({                                             \
        char ____fmt[] = fmt;                      \
        bpf_trace_printk(____fmt, sizeof(____fmt), \
                         ##__VA_ARGS__);           \
    })

查看调试信息

代码侧已经添加好日志打印函数,那如何观察到日志输出呢?上文提到了一个专门记录日志的文件夹,里面的文件就是保持不同trace日志的。我们的bpf程序日志可以通过读取这个文件/sys/kernel/debug/tracing/trace_pipe

# 它是一个流,会不停读取信息
$ cat /sys/kernel/debug/tracing/trace_pipe
# 另一个种等价方式
$ tail -f /sys/kernel/debug/tracing/trace

参考

你的第一个TC BPF 程序:https://davidlovezoe.club/wordpress/archives/952
调试你的BPF程序: https://davidlovezoe.club/wordpress/archives/963