一、产品分类

1.1 初步了解

InfiniBand enabled

  • H-series
  • N-series

支持InfiniBand OFED 5.1

  • H-series: HB, HC, HBv2, HBv3
  • N-series: NDv2, NDv4

支持RDMA (RDMA capable)

  • HDR rate: HBv3, HBv2
  • EDR rate: HB, HC, NDv2
  • FDR rate: H16r, H16mr以及其他部分支持RDMA的N-series

SR-IOV support

  • 目前Azure的HPC分为两类,取决于vm是否支持SR-IOV的InfiniBand。几乎所有新的机型无论是RDMA-capable或者InfiniBand enabled都是使用了SR-IOV,除了H16r,H16mr和NC24r。
  • RDMA只支持over IB,不支持over Ethernet。(非SR-IOV下使用RDMA需要ND drivers - Network Direct)
  • IPoIB只在SR-IOV的vm上支持。

术语

  • pNUMA: physical NUMA domain
  • vNUMA: virtualized NUMA domain
  • pCore: physical CPU core
  • vCore: virtualized CPU core

    1.2 具体机型

    1.2.1 HB-series

    Hardware Specification

  • Cores: 60 (SMT disabled)

  • CPU: AMD EPYC 7551 (2*32core)
  • CPU Frequency (non-AVX): 2.55GHz
  • Memory: 4GB/core (240GB total)
  • Local Disk: 700 GB SSD
  • InfiniBand: 100Gb EDR Mellanox ConnectX-5
  • Network: 50Gb Ethernet (40 Gb usable) Azure second Gen SmartNIC

AZURE产品调研 - H-series - 图1
注意点:

  1. 共16个NUMA节点,每个CCX 4core
  2. NUMA0 (cpu 0-3)分配给hypervisor,虚拟机使用其中60C
  3. 不开超线程
  4. 相邻的一组CCX共享两个DRAM channel(32GB per DRAM)

    1.2.2 HBv2-series

    Hardware Specification
  • Cores: 120 (SMT disabled)
  • CPU: AMD EPYC 7742 (2*64core)
  • CPU Frequency (non-AVX): 3.1GHz
  • Memory: 4GB/core (480GB total)
  • Local Disk: 960GB NVMe, 480GB SSD
  • InfiniBand: 200Gb EDR Mellanox ConnectX-6
  • Network: 50Gb Ethernet (40 Gb usable) Azure second Gen SmartNIC

注意点:

  1. 共32个NUMA节点,每个CCX 4core
  2. NUMA0 和NUMA16(每个socket的第一个CCX)分配给hypervisor,虚拟机使用其中120C
  3. 不开超线程
  4. 四个相邻的CCX共享两个DRAM channel

    1.2.3 HBv3-series

    Hardware Specification
  • Cores: 120, 96, 64, 32, 16 (SMT disabled)
  • CPU: AMD EPYC 7V13 (2*64core)
  • CPU Frequency (non-AVX): 3.1GHz (all cores), 3.675GHz (up to 10 cores)
  • Memory: 448GB
  • Local Disk: 2*960GB NVMe, 480GB SSD
  • InfiniBand: 200Gb HDR Mellanox ConnectX-6
  • Network: 50Gb Ethernet (40 Gb usable) Azure second Gen SmartNIC

AZURE产品调研 - H-series - 图2
注意点:

  1. NPS2
  2. 每个NUMA节点拥有4个直连DRAM channel,可达3200MT/s
  3. 每个服务器预留8个物理核给hypervisor,取每个NUMA domain的第一个CCD的前两个核
  4. 每个CCD的8个物理核共享32MB的L3 cache
  5. 120, 96, 64, 32, 16不同的规格分别在4个NUMA domain中取对应的核数30, 24, 16, 8, 4 (通过减少暴露给VM的物理核实现,其他共享的资源不变)
  6. NIC通过SRIOV passthrough给VM
  7. 支持Adaptive Routing, the Dynamic Connected Transport (DCT, in addition to standard RC and UD transports)

    1.2.4 HC-series

    Hardware Specification
  • Cores: 44 (HT disabled)
  • CPU: Intel Xeon Platinum 8168 (2*24core)
  • CPU Frequency (non-AVX): 2.7-3.4GHz (all cores), 3.7GHz (single core)
  • Memory: 8GB/core (352 total)
  • Local Disk: 700GB SSD
  • InfiniBand: 100Gb EDR Mellanox ConnectX-5
  • Network: 50Gb Ethernet (40 Gb usable) Azure second Gen SmartNIC

AZURE产品调研 - H-series - 图3
注意点:

  1. 预留pCore 0-1和24-25(每个socket上的前两个pCore)给hypervisor
  2. 每个NUMA domain有6个DRAM channel

    1.3 网卡配置

    支持RDMA的机型配置两张网卡,一张跑以太网,一张跑RDMA网络。
    配置IB网络

    1.4 已知问题

    1.4.1 qp0访问控制

    为了防止访问底层硬件带来的安全隐患,guest不能使用Queue Pair 0。这应该只影响ConnectX NIC的管理,或者跑类似ibdiagnet的诊断工具,不应该对用户应用程序有影响。

    1.4.2 Duplicate MAC with cloud-init with Ubuntu on H-series and N-series VMs

    为已知问题,在高版本的内核中会解决。可能出现在vm重启或创建vm镜像之后。walkround见参考文档中的链接。

    二、应用场景

    加速MPI (Message Passing Interface)应用的扩展性和性能。

常见的MPI library架构图:
AZURE产品调研 - H-series - 图4

MPI的程序中有些需要指定pkey来达到租户隔离,如下所示
pkey

cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/0
0x800b
cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/1
0x7fff

三、计费

结论
从计费上可以看出,HB120-16rs_v3、HB120-32rs_v3、HB120-64rs_v3、HB120-96rs_v3、HB120_v3系列不管客户购买多少cpu,每台物理机上均只卖一台vm,其余不使用的cpu不对vm展示即可。

四、虚拟机配置

选取机型HBv3
CentOS_HPC 7.9
ip a

[azureuser@hbv3 ~]$ ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:22:48:20:34:94 brd ff:ff:ff:ff:ff:ff
inet 10.1.0.4/24 brd 10.1.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::222:48ff:fe20:3494/64 scope link
valid_lft forever preferred_lft forever
3: eth1: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:15:5d:33:ff:39 brd ff:ff:ff:ff:ff:ff
4: ib0: mtu 2044 qdisc mq state UP group default qlen 256
link/infiniband 00:00:09:28:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:39 brd 00:ff:ff:ff:ff:12:40:1b:80:52:00:00:00:00:00:00:ff:ff:ff:ff
inet 172.16.1.48/16 brd 172.16.255.255 scope global ib0
valid_lft forever preferred_lft forever
inet6 fe80::215:5dff:fd33:ff39/64 scope link
valid_lft forever preferred_lft forever

lspci

[azureuser@hbv3 ~]$ lspci
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
52af:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
6118:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
f71d:00:02.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]

ethtool -i ethX

[azureuser@hbv3 ~]$ ethtool -i eth0
driver: hv_netvsc
version:
firmware-version: N/A
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[azureuser@hbv3 ~]$ ethtool -i eth1
driver: hv_netvsc
version:
firmware-version: N/A
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

lsmod

[azureuser@hbv3 ~]$ lsmod |grep ib
ib_ipoib 124872 0
ib_cm 53085 2 rdma_cm,ib_ipoib
ib_umad 27744 0
mlx5_ib 371350 0
ib_uverbs 132788 2 mlx5_ib,rdma_ucm
ib_core 357701 8 rdma_cm,ib_cm,iw_cm,mlx5_ib,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
libcrc32c 12644 2 xfs,nf_conntrack
mlx5_core 1330983 1 mlx5_ib
libata 243094 3 pata_acpi,ata_generic,ata_piix
mlx_compat 55063 10 rdma_cm,ib_cm,iw_cm,mlx5_ib,ib_core,ib_umad,ib_uverbs,mlx5_core,rdma_ucm,ib_ipoib

lscpu

[azureuser@hbv3 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 120
On-line CPU(s) list: 0-119
Thread(s) per core: 1
Core(s) per socket: 60
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7V13 64-Core Processor
Stepping: 0
CPU MHz: 2445.403
BogoMIPS: 4890.80
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-29
NUMA node1 CPU(s): 30-59
NUMA node2 CPU(s): 60-89
NUMA node3 CPU(s): 90-119
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat umip vaes vpclmulqdq

rpm

[azureuser@hbv3 ~]$ rpm -qa|grep ofed
ofed-scripts-5.2-OFED.5.2.2.2.3.x86_64
mlnxofed-docs-5.2-2.2.3.0.noarch
[azureuser@hbv3 ~]$ rpm -qa|grep mpi
mpi-selector-1.0.3-1.52223.x86_64
mpitests_openmpi-3.2.20-5d20b49.52223.x86_64
intel-mpi-doc-2018-2018.4-274.x86_64
intel-mpi-samples-2018.4-274-2018.4-274.x86_64
intel-mpi-installer-license-2018.4-274-2018.4-274.x86_64
intel-mpi-sdk-2018.4-274-2018.4-274.x86_64
openmpi-4.1.0rc5-1.52223.x86_64
intel-mpi-rt-2018.4-274-2018.4-274.x86_64
intel-mpi-psxe-2018.4-057-2018.4-057.x86_64
[azureuser@hbv3 ~]$ rpm -qa|grep sharp
sharp-2.4.5.MLNX20210302.7c3c223-1.52223.x86_64
[azureuser@hbv3 ~]$ rpm -qa|grep sm
opensm-libs-5.8.1.MLNX20210120.81574f7-0.1.52223.x86_64
libsmartcols-2.23.2-65.el7_9.1.x86_64
opensm-5.8.1.MLNX20210120.81574f7-0.1.52223.x86_64
opensm-static-5.8.1.MLNX20210120.81574f7-0.1.52223.x86_64
smartmontools-7.0-2.el7.x86_64
psmisc-22.20-17.el7.x86_64
opensm-devel-5.8.1.MLNX20210120.81574f7-0.1.52223.x86_64

pkey

[azureuser@hbv3 site-packages]$ cd /sys/class/infiniband/mlx5_ib0/ports/1
[azureuser@hbv3 1]$ ls
cap_mask cm_rx_msgs cm_tx_retries gid_attrs has_smi lid link_layer pkeys sm_lid state
cm_rx_duplicates cm_tx_msgs counters gids hw_counters lid_mask_count phys_state rate sm_sl
[azureuser@hbv3 1]$ cd pkeys/
[azureuser@hbv3 pkeys]$ ls
0 102 107 111 116 120 125 15 2 24 29 33 38 42 47 51 56 60 65 7 74 79 83 88 92 97
1 103 108 112 117 121 126 16 20 25 3 34 39 43 48 52 57 61 66 70 75 8 84 89 93 98
10 104 109 113 118 122 127 17 21 26 30 35 4 44 49 53 58 62 67 71 76 80 85 9 94 99
100 105 11 114 119 123 13 18 22 27 31 36 40 45 5 54 59 63 68 72 77 81 86 90 95
101 106 110 115 12 124 14 19 23 28 32 37 41 46 50 55 6 64 69 73 78 82 87 91 96
[azureuser@hbv3 pkeys]$
[azureuser@hbv3 pkeys]$ cat 120
0x0000
[azureuser@hbv3 pkeys]$ cat 119
0x0000
[azureuser@hbv3 pkeys]$ cat 0
0x8052
[azureuser@hbv3 pkeys]$ cat 1
0x7fff
[azureuser@hbv3 pkeys]$ cat 2
0x0000

CentOS 7.9非HPC镜像
ip a

[azureuser@hbv3 ~]$ ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:11:81:b1 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.4/24 brd 10.0.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::20d:3aff:fe11:81b1/64 scope link
valid_lft forever preferred_lft forever
3: eth1: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:15:5d:34:00:0a brd ff:ff:ff:ff:ff:ff
4: eth2: mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 00:0d:3a:11:81:b1 brd ff:ff:ff:ff:ff:ff

lspci

[azureuser@hbv3 network-scripts]$ lspci
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
037e:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80)
7af8:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
93b8:00:02.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
a929:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111

ethtool -i ethX

[azureuser@hbv3 network-scripts]$ ethtool -i eth0
driver: hv_netvsc
version:
firmware-version: N/A
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[azureuser@hbv3 network-scripts]$ ethtool -i eth1
driver: hv_netvsc
version:
firmware-version: N/A
expansion-rom-version:
bus-info:
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[azureuser@hbv3 network-scripts]$ ethtool -i eth2
driver: mlx5_core
version: 5.0-0
firmware-version: 14.25.8368 (MSF0010110035)
expansion-rom-version:
bus-info: 037e:00:02.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Virtual Machine介绍