相关文档

SR-IOV(Single Root I/O Virtualization)-重点

建议:在看完PCIE部分在学习,因为SR-IOV是PCIE的特性

参考:

  • 《KVM实战原理、进阶与实战调优》6.2节,或者 基础扫盲 (大神从书上抄过来了)

如何开启PC的SR-IOV?

必备条件:

  • [*]在AMD机器上,需要使能svm和IOMMU
  • [*]设备端支持SR-IOV (个人用支持可能性比较少,必须确认设备端支持)
  1. # Intel系列CPU支持虚拟化的标志为“vmx”,AMD系列CPU的标志为“svm”
  2. baiy@baiy-ThinkPad-E470c:~$ grep -E 'svm|vmx' /proc/cpuinfo
  3. flags : ..... vmx .....
  4. # 支持SR-IOV的功能
  5. [root@node1 ~]# lspci -nn | grep Eth
  6. 08:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
  7. 08:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
  8. [root@node1 ~]# lspci -s 08:00.0 -vvv | grep Capabilities
  9. Capabilities: [40] Power Management version 3
  10. Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
  11. Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
  12. Capabilities: [a0] Express (v2) Endpoint, MSI 00
  13. Capabilities: [100 v2] Advanced Error Reporting
  14. Capabilities: [140 v1] Device Serial Number f8-0f-41-ff-ff-f4-af-6c
  15. Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
  16. Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) ******
  17. Capabilities: [1a0 v1] Transaction Processing Hints
  18. Capabilities: [1c0 v1] Latency Tolerance Reporting
  19. Capabilities: [1d0 v1] Access Control Services

可惜我的设备不支持
image.png

怎么确定自己的BIOS是否开启了SR-IOV?
要设备支持SR-IOV,INTEL必须支持VT-X和VT-D,AMD必须支持VMX和IOMMU。
检查BIOS / UEFI中是否已启用VT-D / IOMMU ,这里ubuntu需要安装 sysfsutils 包

  1. baiy@internal:baiy$ dmesg | grep "DMAR-IR: Enabled IRQ remapping"
  2. [ 0.004000] DMAR-IR: Enabled IRQ remapping in x2apic mode
  3. baiy@internal:baiy$ kvm-ok
  4. INFO: /dev/kvm exists
  5. KVM acceleration can be used

注:《kvm实战:原理..》中6.2节,描述了如何识别设备是否启动了IOMMU,与这个方式一样,别嫌low,没办法。

/etc/default/grub中加入

  1. GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt"
  2. 或者
  3. GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt"

执行 sudo update grub && sudo update grub2

什么是PCIE直通模式和SR-IOV?

可以看到:

  • 直通模式就是 虚拟机这种将设备从主机分离,链接到虚拟机。
  • SR-IOV模式就是 PCIE内部可以有 多个PF+N个VF(每个PF可以支持多路VF), PF提供资源使用和管理,VF只能使用PF分给的资源。

直通模式和SR-IOV模式操作方式基本类似。都是先将驱动导入到vfio中,然后在链接到设备里。

PF和VF

  • 物理功能 (Physical Function, PF)
    用于支持 SR-IOV 功能的 PCI 功能,如 SR-IOV 规范中定义。PF 包含 SR-IOV 功能结构,用于管理 SR-IOV 功能。PF 是全功能的 PCIe 功能,可以像其他任何 PCIe 设备一样进行发现、管理和处理。PF 拥有完全配置资源,可以用于配置或控制 PCIe 设备。
  • 虚拟功能 (Virtual Function, VF)
    与物理功能关联的一种功能。VF 是一种轻量级 PCIe 功能,可以与物理功能以及与同一物理功能关联的其他 VF 共享一个或多个物理资源。VF 仅允许拥有用于其自身行为的配置资源。

每个 SR-IOV 设备都可有一个物理功能 (Physical Function, PF),并且每个 PF 最多可有 64,000 个与其关联的虚拟功能 (Virtual Function, VF)。
PF 可以通过寄存器创建 VF,这些寄存器设计有专用于此目的的属性。

一旦在 PF 中启用了 SR-IOV,就可以通过 PF 的总线、设备和功能编号(路由 ID)访问各个 VF 的 PCI 配置空间。每个 VF 都具有一个 PCI 内存空间,用于映射其寄存器集。
VF 设备驱动程序对寄存器集进行操作以启用其功能,并且显示为实际存在的 PCI 设备。创建 VF 后,可以直接将其指定给 IO 来宾域或各个应用程序(如裸机平台上的 Oracle Solaris Zones)。此功能使得虚拟功能可以共享物理设备,并在没有 CPU 和虚拟机管理程序软件开销的情况下执行 I/O。

SR-IOV的使用

PCIE设备信息

注:PCIE支持SR-IOV的话,必须支持0x100-0xFFF扩展配置空间,且这部分配置在扩展配置空间中。
image.png

  1. ### pcie详细信息
  2. pcie_13_test$ sudo lspci -s 26:00.0 -vvv
  3. 26:00.0 Memory controller: Xilinx Corporation Device 9032 (rev 03)
  4. Subsystem: Xilinx Corporation Device 9032
  5. Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
  6. Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
  7. Interrupt: pin A routed to IRQ 10
  8. Region 0: Memory at f0000000 (32-bit, non-prefetchable) [disabled] [size=16M]
  9. Region 2: Memory at e0000000 (32-bit, non-prefetchable) [disabled] [size=256M]
  10. Expansion ROM at f1000000 [disabled] [size=512K]
  11. Capabilities: [40] Power Management version 3
  12. Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
  13. Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
  14. Capabilities: [60] MSI-X: Enable- Count=2 Masked-
  15. Vector table: BAR=0 offset=00000040
  16. PBA: BAR=0 offset=00000050
  17. Capabilities: [70] Express (v2) Endpoint, MSI 00
  18. DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
  19. ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
  20. DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
  21. RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
  22. MaxPayload 512 bytes, MaxReadReq 512 bytes
  23. DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
  24. LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
  25. ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
  26. LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
  27. ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  28. LnkSta: Speed 8GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
  29. DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
  30. DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
  31. LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
  32. Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
  33. Compliance De-emphasis: -6dB
  34. LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
  35. EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
  36. Capabilities: [100 v1] Advanced Error Reporting
  37. UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  38. UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  39. UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
  40. CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
  41. CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
  42. AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
  43. Capabilities: [140 v1] Single Root I/O Virtualization (SR-IOV)
  44. IOVCap: Migration-, Interrupt Message Number: 000
  45. IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
  46. IOVSta: Migration-
  47. Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00
  48. VF offset: 4, stride: 1, Device ID: 0000
  49. Supported Page Size: 00000553, System Page Size: 00000001
  50. Region 0: Memory at f1080000 (32-bit, non-prefetchable)
  51. VF Migration: offset: 00000000, BIR: 0
  52. Capabilities: [180 v1] Alternative Routing-ID Interpretation (ARI)
  53. ARICap: MFVC- ACS-, Next Function: 0
  54. ARICtl: MFVC- ACS-, Function Group: 0
  55. Capabilities: [1c0 v1] #19
  56. ### pcie配置信息
  57. pcie_13_test$ sudo lspci -s 26:00.0 -xxx
  58. 26:00.0 Memory controller: Xilinx Corporation Device 9032 (rev 03)
  59. 00: ee 10 32 90 00 00 10 00 03 00 80 05 10 00 00 00
  60. 10: 00 00 00 f0 00 00 00 00 00 00 00 e0 00 00 00 00
  61. 20: 00 00 00 00 00 00 00 00 00 00 00 00 ee 10 32 90
  62. 30: 00 00 00 f1 40 00 00 00 00 00 00 00 0a 01 00 00
  63. 40: 01 60 03 00 08 00 00 00 05 60 80 01 00 00 00 00
  64. 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  65. 60: 11 70 01 00 40 00 00 00 50 00 00 00 00 00 00 00
  66. 70: 10 00 02 00 02 80 00 00 50 28 09 00 23 f0 43 00
  67. 80: 40 00 23 10 00 00 00 00 00 00 00 00 00 00 00 00
  68. 90: 00 00 00 00 16 00 00 00 00 00 00 00 0e 00 00 00
  69. a0: 03 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
  70. b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  71. c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  72. d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  73. e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  74. f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  75. ### pcie资源信息
  76. [ 1179.857932] driver_pcie_test: loading out-of-tree module taints kernel.
  77. [ 1179.857975] driver_pcie_test: module verification failed: signature and/or required key missing - tainting kernel
  78. [ 1179.858697] driver_pcie_test: init [E]
  79. [ 1179.858806] driver_pcie_test: prbe [E]
  80. [ 1179.858816] demo-pcie 0000:26:00.0: enabling device (0000 -> 0002)
  81. [ 1179.858914] resource: start:0xf0000000,end:0xf0ffffff,name:0000:26:00.0,flags:0x40200,desc:0x0
  82. [ 1179.858916] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  83. [ 1179.858916] resource: start:0xe0000000,end:0xefffffff,name:0000:26:00.0,flags:0x40200,desc:0x0
  84. [ 1179.858917] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  85. [ 1179.858918] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  86. [ 1179.858918] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  87. [ 1179.858919] resource: start:0xf1000000,end:0xf107ffff,name:0000:26:00.0,flags:0x46200,desc:0x0
  88. [ 1179.858920] resource: start:0xf1080000,end:0xf1083fff,name:0000:26:00.0,flags:0x40200,desc:0x0
  89. [ 1179.858921] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  90. [ 1179.858921] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  91. [ 1179.858922] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  92. [ 1179.858922] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  93. [ 1179.858923] resource: start:0x0,end:0x0,name:0000:26:00.0,flags:0x0,desc:0x0
  94. [ 1179.858924] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  95. [ 1179.858925] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  96. [ 1179.858925] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  97. [ 1179.858926] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  98. [ 1179.858959] driver_pcie_test: prbe [X]

驱动中使能SR-IOV(软件)

这部分放在pci_enable_device之前即可
可以参考:pci-iov-howto.rst, 后边有个简化版本的pcie使能sriov的例程。

  1. #define NUMBER_OF_VFS (4) # 根据自己设备所支持的最大数确定
  2. // 这部分代码最好参考:linux-4.14.48/drivers/net/ethernet/cisco/enic/enic_main.c
  3. // 中probe和remove函数实现
  4. static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
  5. {
  6. #ifdef CONFIG_PCI_IOV
  7. int pos = 0;
  8. unsigned int num_vfs;
  9. pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
  10. /* Get number of subvnics */
  11. pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
  12. if (pos) {
  13. pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF,
  14. &num_vfs);
  15. if (num_vfs) {
  16. err = pci_enable_sriov(pdev, num_vfs);
  17. if (err) {
  18. dev_err(dev, "SRIOV enable failed, aborting."
  19. " pci_enable_sriov() returned %d\n",
  20. err);
  21. goto xxx;
  22. }
  23. enic->priv_flags |= ENIC_SRIOV_ENABLED;
  24. num_pps = enic->num_vfs;
  25. }
  26. }
  27. #endif
  28. ...
  29. return 0;
  30. }
  31. static void dev_remove(struct pci_dev *dev)
  32. {
  33. #ifdef CONFIG_PCI_IOV
  34. if (enic_sriov_enabled(enic)) {
  35. pci_disable_sriov(pdev);
  36. enic->priv_flags &= ~ENIC_SRIOV_ENABLED;
  37. }
  38. #endif
  39. ...
  40. }
  41. // 为了支持sysfs来使能sriov
  42. static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
  43. {
  44. if (numvfs > 0) {
  45. ...
  46. pci_enable_sriov(dev, numvfs);
  47. ...
  48. return numvfs;
  49. }
  50. if (numvfs == 0) {
  51. ....
  52. pci_disable_sriov(dev);
  53. ...
  54. return 0;
  55. }
  56. }
  57. static struct pci_driver dev_driver = {
  58. .name = "SR-IOV Physical Function driver",
  59. .id_table = dev_id_table,
  60. ......
  61. .sriov_configure = dev_sriov_configure,
  62. };

理论上sysfs也可以使能sriov,需要确认

使能SR-IOV后pci设备变化

注:这个9032是DEVICE_ID

  1. lspci
  2. 37 26:00.0 Memory controller: Xilinx Corporation Device 9032 (rev 03)
  3. 38 26:00.4 Memory controller: Xilinx Corporation Device 9031 (rev 03)
  4. 39 26:00.5 Memory controller: Xilinx Corporation Device 9031 (rev 03)
  5. 40 26:00.6 Memory controller: Xilinx Corporation Device 9031 (rev 03)
  6. 41 26:00.7 Memory controller: Xilinx Corporation Device 9031 (rev 03)

可以看到一个设备出来了4个VF+1个PF

  1. qemu-system-centos$ virsh nodedev-list --tree
  2. computer
  3. ...
  4. | |
  5. | +- pci_0000_26_00_0
  6. | +- pci_0000_26_00_4
  7. | +- pci_0000_26_00_5
  8. | +- pci_0000_26_00_6
  9. | +- pci_0000_26_00_7
  10. |
  11. qemu-system-centos$ lshw
  12. *-pci:2
  13. description: PCI bridge
  14. product: Advanced Micro Devices, Inc. [AMD]
  15. vendor: Advanced Micro Devices, Inc. [AMD]
  16. physical id: 3.1
  17. bus info: pci@0000:00:03.1
  18. version: 00
  19. width: 32 bits
  20. clock: 33MHz
  21. capabilities: pci normal_decode bus_master cap_list
  22. configuration: driver=pcieport
  23. resources: irq:28 memory:e0000000-f10fffff
  24. *-memory:0
  25. description: Memory controller
  26. product: Xilinx Corporation
  27. vendor: Xilinx Corporation
  28. physical id: 0
  29. bus info: pci@0000:26:00.0
  30. version: 03
  31. width: 32 bits
  32. clock: 33MHz (30.3ns)
  33. capabilities: bus_master cap_list rom
  34. configuration: driver=demo-pcie latency=0
  35. resources: irq:96 memory:f0000000-f0ffffff memory:e0000000-efffffff memory:f1000000-f107ffff memory:f1080000-f1083fff
  36. *-memory:1 UNCLAIMED
  37. description: Memory controller
  38. product: Illegal Vendor ID
  39. vendor: Illegal Vendor ID
  40. physical id: 0.4
  41. bus info: pci@0000:26:00.4
  42. version: 03
  43. width: 32 bits
  44. clock: 33MHz (30.3ns)
  45. capabilities: cap_list
  46. configuration: latency=0
  47. resources: memory:f1080000-f1080fff
  48. *-memory:2 UNCLAIMED
  49. description: Memory controller
  50. product: Illegal Vendor ID
  51. vendor: Illegal Vendor ID
  52. physical id: 0.5
  53. bus info: pci@0000:26:00.5
  54. version: 03
  55. width: 32 bits
  56. clock: 33MHz (30.3ns)
  57. capabilities: cap_list
  58. configuration: latency=0
  59. resources: memory:f1081000-f1081fff
  60. *-memory:3 UNCLAIMED
  61. description: Memory controller
  62. product: Illegal Vendor ID
  63. vendor: Illegal Vendor ID
  64. physical id: 0.6
  65. bus info: pci@0000:26:00.6
  66. version: 03
  67. width: 32 bits
  68. clock: 33MHz (30.3ns)
  69. capabilities: cap_list
  70. configuration: latency=0
  71. resources: memory:f1082000-f1082fff
  72. *-memory:4 UNCLAIMED
  73. description: Memory controller
  74. product: Illegal Vendor ID
  75. vendor: Illegal Vendor ID
  76. physical id: 0.7
  77. bus info: pci@0000:26:00.7
  78. version: 03
  79. width: 32 bits
  80. clock: 33MHz (30.3ns)
  81. capabilities: cap_list
  82. configuration: latency=0
  83. resources: memory:f1083000-f1083fff

SR-IOV的使用

virsh分离pci设备(常用)
  • 列出pci设备
  1. qemu-system-centos$ virsh nodedev-list --tree
  2. computer
  3. ...
  4. | |
  5. | +- pci_0000_26_00_0
  6. | +- pci_0000_26_00_4
  7. | +- pci_0000_26_00_5
  8. | +- pci_0000_26_00_6
  9. | +- pci_0000_26_00_7
  10. |
  • 从主机分离pci设备
  1. virsh nodedev-detach pci_0000_26_00_7
  • 设备添加和分离虚拟机

virsh可以通过修改虚拟机配置脚本来指定
在KVM中配置SR-IOV

为了方便的话,可以直接在virt-manager去添加pci设备
image.png

  • virsh将VF恢复到主机
    1. virsh nodedev-reattach pci_0000_26_00_7

qemu创建虚拟机时使用VF

参考:基础扫盲 一步步来即可。
绑定指定设备到vfio-pci驱动

  1. #!/bin/bash
  2. # A script to hide/unhide PCI/PCIe device for KVM (using 'vfio-pci')
  3. #set -x
  4. hide_dev=0
  5. unhide_dev=0
  6. driver=0
  7. # check if the device exists
  8. function dev_exist()
  9. {
  10. local line_num=$(lspci -s "$1" 2>/dev/null | wc -l)
  11. if [ $line_num = 0 ]; then
  12. echo "Device $pcidev doesn't exists. Please check your system or your command line."
  13. exit 1
  14. else
  15. return 0
  16. fi
  17. }
  18. # output a format "<domain>:<bus>:<slot>.<func>" (e.g. 0000:01:10.0) of device
  19. function canon()
  20. {
  21. f='expr "$1" : '.*\.\(.\)''
  22. d='expr "$1" : ".*:\(.*\).$f"'
  23. b='expr "$1" : "\(.*\):$d\.$f"'
  24. if [ 'expr "$d" : '..'' == 0 ]
  25. then
  26. d=0$d
  27. fi
  28. if [ 'expr "$b" : '.*:'' != 0 ]
  29. then
  30. p='expr "$b" : '\(.*\):''
  31. b='expr "$b" : '.*:\(.*\)''
  32. else
  33. p=0000
  34. fi
  35. if [ 'expr "$b" : '..'' == 0 ]
  36. then
  37. b=0$b
  38. fi
  39. echo $p:$b:$d.$f
  40. }
  41. # output the device ID and vendor ID
  42. function show_id()
  43. {
  44. lspci -Dn -s "$1" | awk '{print $3}' | sed "s/:/ /" > /dev/null 2>&1
  45. if [ $? -eq 0 ]; then
  46. lspci -Dn -s "$1" | awk '{print $3}' | sed "s/:/ /"
  47. else
  48. echo "Can't find device id and vendor id for device $1"
  49. exit 1
  50. fi
  51. }
  52. # hide a device using 'vfio-pci' driver/module
  53. function hide_pci()
  54. {
  55. local pre_driver=NULL
  56. local pcidev=$(canon $1)
  57. local pciid=$(show_id $pcidev)
  58. dev_exist $pcidev
  59. if [ -e /sys/bus/pci/drivers/vfio-pci ]; then
  60. pre_driver=$(basename $(readlink /sys/bus/pci/devices/"$pcidev"/driver))
  61. echo "Unbinding $pcidev from $pre_driver"
  62. echo -n "$pciid" > /sys/bus/pci/drivers/vfio-pci/new_id
  63. echo -n "$pcidev" > /sys/bus/pci/devices/"$pcidev"/driver/unbind
  64. fi
  65. echo "Binding $pcidev to vfio-pci"
  66. echo -n "$pcidev" > /sys/bus/pci/drivers/vfio-pci/bind
  67. return $?
  68. }
  69. function unhide_pci() {
  70. local driver=$2
  71. local pcidev='canon $1'
  72. local pciid='show_id $pcidev'
  73. if [ $driver != 0 -a ! -d /sys/bus/pci/drivers/$driver ]; then
  74. echo "No $driver interface under sys, return fail"
  75. exit 1
  76. fi
  77. if [ -h /sys/bus/pci/devices/"$pcidev"/driver ]; then
  78. local tmpdriver='basename $(readlink /sys/bus/pci/devices/"$pcidev"/driver)'
  79. if [ "$tmpdriver" = "$driver" ]; then
  80. echo "$1 has been already bind with $driver, no need to unhide"
  81. exit 1
  82. elif [ "$tmpdriver" != "vfio-pci" ]; then
  83. echo "$1 is not bind with vfio-pci, it is bind with $tmpdriver, no need to unhide"
  84. exit 1
  85. else
  86. echo "Unbinding $pcidev from" $(basename $(readlink /sys/bus/pci/devices/"$pcidev"/driver))
  87. echo -n "$pcidev" > /sys/bus/pci/drivers/vfio-pci/unbind
  88. if [ $? -ne 0 ]; then
  89. return $?
  90. fi
  91. fi
  92. fi
  93. if [ $driver != 0 ]; then
  94. echo "Binding $pcidev to $driver"
  95. echo -n "$pcidev" > /sys/bus/pci/drivers/$driver/bind
  96. fi
  97. return $?
  98. }
  99. function usage()
  100. {
  101. echo "Usage: vfio-pci.sh -h pcidev "
  102. echo " -h pcidev: <pcidev> is BDF number of the device you want to hide"
  103. echo " -u pcidev: Optional. <pcidev> is BDF number of the device you want to unhide."
  104. echo " -d driver: Optional. When unhiding the device, bind the device with <driver>. The option should be used together with '-u' option"
  105. echo ""
  106. echo "Example1: sh vfio-pci.sh -h 06:10.0 Hide device 06:10.0 to 'vfio-pci' driver"
  107. echo "Example2: sh vfio-pci.sh -u 08:00.0 -d e1000e Unhide device 08:00.0 and bind the device with 'e1000e' driver"
  108. exit 1
  109. }
  110. if [ $# -eq 0 ] ; then
  111. usage
  112. fi
  113. # parse the options in the command line
  114. OPTIND=1
  115. while getopts ":h:u:d:" Option
  116. do
  117. case $Option in
  118. h ) hide_dev=$OPTARG;;
  119. u ) unhide_dev=$OPTARG;;
  120. d ) driver=$OPTARG;;
  121. * ) usage ;;
  122. esac
  123. done
  124. if [ ! -d /sys/bus/pci/drivers/vfio-pci ]; then
  125. modprobe vfio_pci
  126. echo 0
  127. if [ ! -d /sys/bus/pci/drivers/vfio-pci ]; then
  128. echo "There's no 'vfio-pci' module? Please check your kernel config."
  129. exit 1
  130. fi
  131. fi
  132. if [ $hide_dev != 0 -a $unhide_dev != 0 ]; then
  133. echo "Do not use -h and -u option together."
  134. exit 1
  135. fi
  136. if [ $unhide_dev = 0 -a $driver != 0 ]; then
  137. echo "You should set -u option if you want to use -d option to unhide a device and bind it with a specific driver"
  138. exit 1
  139. fi
  140. if [ $hide_dev != 0 ]; then
  141. hide_pci $hide_dev
  142. elif [ $unhide_dev != 0 ]; then
  143. unhide_pci $unhide_dev $driver
  144. fi
  145. exit $?

qemu创建得虚拟机,可以在运行时添加下边参数:

  1. -device vfio-pci,host=0000:26:00.7
  1. # common usual
  2. /usr/local/qemu_x86/bin/qemu-system-x86_64 \
  3. -smp 2 \
  4. -cpu host \
  5. -enable-kvm \
  6. -m 512M \
  7. -kernel linux/arch/x86/boot/bzImage \
  8. -hda ./x86_64.img \
  9. -hdb ./Freeze.img \
  10. -nographic \
  11. -append "root=/dev/sda rw rootfstype=ext4 console=ttyS0 init=linuxrc loglevel=8"

通过sysfs去操作SR-IOV

在上边这种方式,需要自己写驱动去使能SR-IOV,是否可以通过pci的sysfs去配置?
可以,但PF端驱动必须实现接口,也就是上边的 驱动中使能SR-IOV。

  1. /sys/bus/pci/devices/0000:26:00.0$ ls
  2. aer_dev_correctable broken_parity_status current_link_speed dma_mask_bits iommu_group max_link_speed numa_node reset revision sriov_offset subsystem vendor
  3. aer_dev_fatal class current_link_width driver_override irq max_link_width power resource rom sriov_stride subsystem_device
  4. aer_dev_nonfatal config d3cold_allowed enable local_cpulist modalias remove resource0 sriov_drivers_autoprobe sriov_totalvfs subsystem_vendor
  5. ari_enabled consistent_dma_mask_bits device iommu local_cpus msi_bus rescan resource2 sriov_numvfs sriov_vf_device uevent

先分析下代码: linux-5.7.14\drivers\pci\pci-sysfs.c

  1. const struct attribute_group sriov_dev_attr_group = {
  2. .attrs = sriov_dev_attrs,
  3. .is_visible = sriov_attrs_are_visible,
  4. };
  5. pci_alloc_dev
  6. dev->dev.type = &pci_dev_type;
  7. pci_sysfs_init
  8. drivers\pci\iov.c # 中关于sriov的操作接口
  9. sriov_numvfs_store #
  10. sriov_numvfs_show # 这两个操作sriov_numvfs,store必须驱动支持
  11. sriov_numvfs_store
  12. /* is PF driver loaded w/callback */ // PF端驱动必须要sr-iov的初始化代码
  13. if (!pdev->driver || !pdev->driver->sriov_configure) {
  14. pci_info(pdev, "Driver does not support SRIOV configuration via sysfs\n");
  15. ret = -ENOENT;
  16. goto exit;
  17. }

SR-IOV代码分析

linux-5.4.51/drivers/pci/iov.c

  1. static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
  2. struct pci_sriov *iov = dev->sriov;
  3. pci_read_config_word(dev, iov->pos + PCI_SRIOV_INITIAL_VF, &initial);

sriov_enable

  1. /* Single Root I/O Virtualization */
  2. struct pci_sriov {
  3. int pos; /* Capability position */
  4. int nres; /* Number of resources */
  5. u32 cap; /* SR-IOV Capabilities */
  6. u16 ctrl; /* SR-IOV Control */
  7. u16 total_VFs; /* Total VFs associated with the PF */
  8. u16 initial_VFs; /* Initial VFs associated with the PF */
  9. u16 num_VFs; /* Number of VFs available */
  10. u16 offset; /* First VF Routing ID offset */
  11. u16 stride; /* Following VF stride */
  12. u16 vf_device; /* VF device ID */
  13. u32 pgsz; /* Page size for BAR alignment */
  14. u8 link; /* Function Dependency Link */
  15. u8 max_VF_buses; /* Max buses consumed by VFs */
  16. u16 driver_max_VFs; /* Max num VFs driver supports */
  17. struct pci_dev *dev; /* Lowest numbered PF */
  18. struct pci_dev *self; /* This PF */
  19. u32 class; /* VF device */
  20. u8 hdr_type; /* VF header type */
  21. u16 subsystem_vendor; /* VF subsystem vendor */
  22. u16 subsystem_device; /* VF subsystem device */
  23. resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */
  24. bool drivers_autoprobe; /* Auto probing of VFs by driver */
  25. };
  26. int sriov_enable(struct pci_dev *dev, int nr_virtfn)

SRIOV中断管理

KVM虚拟机代码揭秘——中断虚拟化
KVM中断虚拟化浅析

常见问题:

问题优先定位方式

环境问题

  • 先确定硬件是否支持虚拟化和iommu

    1. baiy@internal:baiy$ dmesg | grep "DMAR-IR: Enabled IRQ remapping"
    2. [ 0.004000] DMAR-IR: Enabled IRQ remapping in x2apic mode
    3. baiy@internal:baiy$ kvm-ok
    4. INFO: /dev/kvm exists
    5. KVM acceleration can be used
  • 内核配置是否支持,是否加载vfio模块,驱动是否正常加载 ```bash CONFIG_GART_IOMMU=y #AMD平台相关

    CONFIG_CALGARY_IOMMU is not set #IBM平台相关

    CONFIG_IOMMU_HELPER=y CONFIG_VFIO_IOMMU_TYPE1=m CONFIG_VFIO_NOIOMMU=y CONFIG_IOMMU_API=y CONFIG_IOMMU_SUPPORT=y CONFIG_IOMMU_IOVA=y CONFIG_AMD_IOMMU=y #AMD平台的IOMMU设置 CONFIG_AMD_IOMMU_STATS=y CONFIG_AMD_IOMMU_V2=m CONFIG_INTEL_IOMMU=y #Intel平台的VT-d设置

    CONFIG_INTEL_IOMMU_DEFAULT_ON is not set#Intel平台的VT-d是否默认打开。这里没有选上,需要在kernel boot parameter中加上“intel_iommu=on”

    CONFIG_INTEL_IOMMU_FLOPPY_WA=y

    CONFIG_IOMMU_DEBUG is not set

    CONFIG_IOMMU_STRESS is not set

CONFIG_VFIO_IOMMU_TYPE1=m CONFIG_VFIO=m CONFIG_VFIO_NOIOMMU=y #支持用户空间的VFIO框架 CONFIG_VFIO_PCI=m

CONFIG_VFIO_PCI_VGA is not set #这个是for显卡的VT-d

CONFIG_VFIO_PCI_MMAP=y CONFIG_VFIO_PCI_INTX=y CONFIG_KVM_VFIO=y

  1. 注:3.0以下版本配置略有不同,不考虑。
  2. 注:很多设备默认没安装vfio驱动(可以通过 /sys/bus/pci/drivers/vfio-pci/ 是否存在来判断 ),所以我们需要安装vfio驱动
  3. ```bash
  4. [root@gerrylee ~]# modprobe vfio-pci
  5. [root@gerrylee ~]# lsmod | grep vfio-pci
  6. vfio_pci 41268 0
  7. vfio 32657 2 vfio_iommu_type1,vfio_pci
  8. irqbypass 13503 2 kvm,vfio_pci
  9. [root@gerrylee ~]# ls /sys/bus/pci/drivers/vfio-pci/
  10. bind module new_id remove_id uevent unbind

注:加载到启动脚本中,让内核启动时加载驱动

  1. vi /etc/moudles 添加模块名称:
  2. # 将驱动放到对应目录
  3. # /lib/modules/$(uname -r)/kernel/drivers/pci/
  • 确定硬件是否支持SR-IOV
    1. [root@node1 ~]# lspci -s 08:00.0 -vvv | grep Capabilities
    2. Capabilities: [40] Power Management version 3
    3. Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
    4. Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
    5. Capabilities: [a0] Express (v2) Endpoint, MSI 00
    6. Capabilities: [100 v2] Advanced Error Reporting
    7. Capabilities: [140 v1] Device Serial Number f8-0f-41-ff-ff-f4-af-6c
    8. Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
    9. Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) ******
    10. Capabilities: [1a0 v1] Transaction Processing Hints
    11. Capabilities: [1c0 v1] Latency Tolerance Reporting
    12. Capabilities: [1d0 v1] Access Control Services

device_add失败

  1. internal error: unable to execute QEMU command 'device_add': vfio error: 0000:26:00.7: failed to add PCI capability 0x11[0x10]@0x60: table & pba overlap, or they don't fit in BARs, or don't align
  2. Traceback (most recent call last):
  3. File "/usr/share/virt-manager/virtManager/addhardware.py", line 1318, in _add_device
  4. self.vm.attach_device(self._dev)
  5. File "/usr/share/virt-manager/virtManager/domain.py", line 1093, in attach_device
  6. self._backend.attachDevice(devxml)
  7. File "/usr/lib/python2.7/dist-packages/libvirt.py", line 563, in attachDevice
  8. if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
  9. libvirtError: internal error: unable to execute QEMU command 'device_add': vfio error: 0000:26:00.7:
  10. failed to add PCI capability 0x11[0x10]@0x60: table & pba overlap, or they don't fit in BARs, or don't align

解决方式:BUG记录

  1. This is because the msix table is overlapping with pba. According to below
  2. 'lspci -vv' from host, the distance between msix table offset and pba offset is
  3. only 0x100, although there are 22 entries supported (22 entries need 0x160).
  4. Looks qemu supports at most 0x800.
  5. # sudo lspci -vv
  6. ... ...
  7. 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03) (prog-if 02 [NVM Express])
  8. Subsystem: Intel Corporation Device 390b
  9. ... ...
  10. Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
  11. Vector table: BAR=0 offset=00002000
  12. PBA: BAR=0 offset=00002100

我上边的PCIE配置空间也有这个问题:

MSI-X: Enable- Count=22, 但向量表和PBA只有0x100,实际上一个向量表在BAR空间需要占16字节。 所以这个位置要预留够。

PCIE MSI-X

PF端看VF的ID信息是0xFFFF(正常)

image.png
之前测试,发现主机端(右)读取VF的config空间(lspci -xxx或者sysfs/…/config),发现VENDOR ID和DEVICE ID都是0xFFFFFFFF,但导入客户机(左)却正常;但主机端lspci 获取的pci id正确。

注:不是问题,别人得也是这样现象

VFIO启动时,FLR异常

image.png
设计问题, PCIE规范中规定 FLR必须100ms内完成,但设计需要手动去修改app_flr_vf_done,但在不修改内核的情况实现不了,因此让改内核。详情参考 FLR学习

SRIOV添加出现问题

image.png

  1. [38387.414287] driver_pcie_test: init [E]
  2. [38387.414392] driver_pcie_test: prbe [E]
  3. [38387.522112] pci 0000:25:00.4: [10ee:9232] type 00 class 0x058000
  4. [38387.522818] pci 0000:25:00.4: Adding to iommu group 17
  5. [38387.522923] pci 0000:25:00.5: [10ee:9232] type 00 class 0x058000
  6. [38387.523121] pci 0000:25:00.5: Adding to iommu group 17
  7. [38387.523162] pci 0000:25:00.6: [10ee:9232] type 00 class 0x058000
  8. [38387.523357] pci 0000:25:00.6: Adding to iommu group 17
  9. [38387.523395] pci 0000:25:00.7: [10ee:9232] type 00 class 0x058000
  10. [38387.523592] pci 0000:25:00.7: Adding to iommu group 17
  11. [38387.523706] resource: start:0xf7560000,end:0xf757ffff,name:0000:25:00.0,flags:0x40200,desc:0x0
  12. [38387.523708] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  13. [38387.523709] resource: start:0xf7540000,end:0xf755ffff,name:0000:25:00.0,flags:0x40200,desc:0x0
  14. [38387.523710] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  15. [38387.523711] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  16. [38387.523712] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  17. [38387.523713] resource: start:0xf75e0000,end:0xf75e07ff,name:0000:25:00.0,flags:0x46200,desc:0x0
  18. [38387.523714] resource: start:0xf7580000,end:0xf759ffff,name:0000:25:00.0,flags:0x40200,desc:0x0
  19. [38387.523715] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  20. [38387.523716] resource: start:0xf75c0000,end:0xf75cffff,name:0000:25:00.0,flags:0x40200,desc:0x0
  21. [38387.523717] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  22. [38387.523718] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  23. [38387.523719] resource: start:0x0,end:0x0,name:0000:25:00.0,flags:0x0,desc:0x0
  24. [38387.523720] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  25. [38387.523721] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  26. [38387.523722] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  27. [38387.523723] resource: start:0x0,end:0x0,name:(null),flags:0x0,desc:0x0
  28. [38387.523762] resource sanity check: requesting [mem 0xf7580000-0xf759ffff], which spans more than 0000:25:00.4 [mem 0xf7580000-0xf7587fff]
  29. [38387.523766] caller pci_iomap_range+0x63/0x80 mapping multiple BARs
  30. [38387.523774] resource sanity check: requesting [mem 0xf75c0000-0xf75cffff], which spans more than 0000:25:00.4 [mem 0xf75c0000-0xf75c3fff]
  31. [38387.523776] caller pci_iomap_range+0x63/0x80 mapping multiple BARs
  32. [38387.523777] driver_pcie_test: prbe [X]
  33. [38851.805101] kauditd_printk_skb: 14 callbacks suppressed
  34. [38851.805103] audit: type=1400 audit(1606393021.635:122): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14424 comm="apparmor_parser"
  35. [38851.923570] audit: type=1400 audit(1606393021.755:123): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14427 comm="apparmor_parser"
  36. [38852.038165] audit: type=1400 audit(1606393021.867:124): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14430 comm="apparmor_parser"
  37. [38852.153167] audit: type=1400 audit(1606393021.983:125): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14433 comm="apparmor_parser"
  38. [38852.167053] virbr0: port 2(vnet0) entered blocking state
  39. [38852.167057] virbr0: port 2(vnet0) entered disabled state
  40. [38852.167122] device vnet0 entered promiscuous mode
  41. [38852.167329] virbr0: port 2(vnet0) entered blocking state
  42. [38852.167332] virbr0: port 2(vnet0) entered listening state
  43. [38852.286598] audit: type=1400 audit(1606393022.119:126): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14465 comm="apparmor_parser"
  44. [38852.404461] audit: type=1400 audit(1606393022.235:127): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14514 comm="apparmor_parser"
  45. [38852.534855] virbr0: port 2(vnet0) entered disabled state
  46. [38852.537183] device vnet0 left promiscuous mode
  47. [38852.537188] virbr0: port 2(vnet0) entered disabled state
  48. [38852.895392] audit: type=1400 audit(1606393022.727:128): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="libvirt-da077ee4-5562-46ed-8f43-020ddf3410e5" pid=14573 comm="apparmor_parser"
  1. Error starting domain: internal error: qemu unexpectedly closed the monitor: 2020-11-26T12:17:02.307759Z qemu-system-x86_64: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xc,drive=drive-virtio-disk2,id=virtio-disk2: Failed to get "write" lock
  2. Is another process using the image?
  3. Traceback (most recent call last):
  4. File "/usr/share/virt-manager/virtManager/asyncjob.py", line 89, in cb_wrapper
  5. callback(asyncjob, *args, **kwargs)
  6. File "/usr/share/virt-manager/virtManager/asyncjob.py", line 125, in tmpcb
  7. callback(*args, **kwargs)
  8. File "/usr/share/virt-manager/virtManager/libvirtobject.py", line 82, in newfn
  9. ret = fn(self, *args, **kwargs)
  10. File "/usr/share/virt-manager/virtManager/domain.py", line 1508, in startup
  11. self._backend.create()
  12. File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1062, in create
  13. if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
  14. libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-11-26T12:17:02.307759Z qemu-system-x86_64: -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xc,drive=drive-virtio-disk2,id=virtio-disk2: Failed to get "write" lock
  15. Is another process using the image?