深度学习模型编译快速入门教程

作者Yao Wang, Truman Tian
本例显示了如何使用Relay python前端构建神经网络,以及如何使用TVM为Nvidia GPU生成运行时库。请注意,您需要在启用cuda和llvm的情况下构建TVM。

TVM支持的硬件后端概述

下图显示了TVM当前支持的硬件后端:
[翻译]深度学习模型编译快速入门教程 - 图1
在本教程中,我们将选择cuda和llvm作为目标后端。
首先,让我们导入Relay和TVM。

  1. import numpy as np
  2. from tvm import relay
  3. from tvm.relay import testing
  4. import tvm
  5. from tvm import te
  6. from tvm.contrib import graph_runtime

在Relay中定义神经网络

首先,用Relay python前端定义神经网络。为简单起见,在Relay中使用预定义的resnet-18网络,使用Xavier初始化程序初始化参数。Relay还支持其他模型格式,例如MXNet、CoreML、ONNX和Tensorflow。
本教程假设在GPU设备上进行推理,批大小设置为1,输入图像是大小为224 * 224的RGB彩色图像,可以调用tvm.relay.expr.TupleWrapper.astext()显示网络结构。

  1. batch_size = 1
  2. num_class = 1000
  3. image_shape = (3, 224, 224)
  4. data_shape = (batch_size,) + image_shape
  5. out_shape = (batch_size, num_class)
  6. mod, params = relay.testing.resnet.get_workload(
  7. num_layers=18, batch_size=batch_size, image_shape=image_shape)
  8. # set show_meta_data=True if you want to show meta data
  9. print(mod.astext(show_meta_data=False))

输出:

  1. v0.0.4
  2. def @main(%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3), float32], %bn_data_beta: Tensor[(3), float32], %bn_data_moving_mean: Tensor[(3), float32], %bn_data_moving_var: Tensor[(3), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64), float32], %bn0_beta: Tensor[(64), float32], %bn0_moving_mean: Tensor[(64), float32], %bn0_moving_var: Tensor[(64), float32], %stage1_unit1_bn1_gamma: Tensor[(64), float32], %stage1_unit1_bn1_beta: Tensor[(64), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64), float32], %stage1_unit1_bn1_moving_var: Tensor[(64), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64), float32], %stage1_unit1_bn2_beta: Tensor[(64), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64), float32], %stage1_unit1_bn2_moving_var: Tensor[(64), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1), float32], %stage1_unit2_bn1_gamma: Tensor[(64), float32], %stage1_unit2_bn1_beta: Tensor[(64), float32], %stage1_unit2_bn1_moving_mean: Tensor[(64), float32], %stage1_unit2_bn1_moving_var: Tensor[(64), float32], %stage1_unit2_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit2_bn2_gamma: Tensor[(64), float32], %stage1_unit2_bn2_beta: Tensor[(64), float32], %stage1_unit2_bn2_moving_mean: Tensor[(64), float32], %stage1_unit2_bn2_moving_var: Tensor[(64), float32], %stage1_unit2_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage2_unit1_bn1_gamma: Tensor[(64), float32], %stage2_unit1_bn1_beta: Tensor[(64), float32], %stage2_unit1_bn1_moving_mean: Tensor[(64), float32], %stage2_unit1_bn1_moving_var: Tensor[(64), float32], %stage2_unit1_conv1_weight: Tensor[(128, 64, 3, 3), float32], %stage2_unit1_bn2_gamma: Tensor[(128), float32], %stage2_unit1_bn2_beta: Tensor[(128), float32], %stage2_unit1_bn2_moving_mean: Tensor[(128), float32], %stage2_unit1_bn2_moving_var: Tensor[(128), float32], %stage2_unit1_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit1_sc_weight: Tensor[(128, 64, 1, 1), float32], %stage2_unit2_bn1_gamma: Tensor[(128), float32], %stage2_unit2_bn1_beta: Tensor[(128), float32], %stage2_unit2_bn1_moving_mean: Tensor[(128), float32], %stage2_unit2_bn1_moving_var: Tensor[(128), float32], %stage2_unit2_conv1_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit2_bn2_gamma: Tensor[(128), float32], %stage2_unit2_bn2_beta: Tensor[(128), float32], %stage2_unit2_bn2_moving_mean: Tensor[(128), float32], %stage2_unit2_bn2_moving_var: Tensor[(128), float32], %stage2_unit2_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage3_unit1_bn1_gamma: Tensor[(128), float32], %stage3_unit1_bn1_beta: Tensor[(128), float32], %stage3_unit1_bn1_moving_mean: Tensor[(128), float32], %stage3_unit1_bn1_moving_var: Tensor[(128), float32], %stage3_unit1_conv1_weight: Tensor[(256, 128, 3, 3), float32], %stage3_unit1_bn2_gamma: Tensor[(256), float32], %stage3_unit1_bn2_beta: Tensor[(256), float32], %stage3_unit1_bn2_moving_mean: Tensor[(256), float32], %stage3_unit1_bn2_moving_var: Tensor[(256), float32], %stage3_unit1_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit1_sc_weight: Tensor[(256, 128, 1, 1), float32], %stage3_unit2_bn1_gamma: Tensor[(256), float32], %stage3_unit2_bn1_beta: Tensor[(256), float32], %stage3_unit2_bn1_moving_mean: Tensor[(256), float32], %stage3_unit2_bn1_moving_var: Tensor[(256), float32], %stage3_unit2_conv1_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit2_bn2_gamma: Tensor[(256), float32], %stage3_unit2_bn2_beta: Tensor[(256), float32], %stage3_unit2_bn2_moving_mean: Tensor[(256), float32], %stage3_unit2_bn2_moving_var: Tensor[(256), float32], %stage3_unit2_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage4_unit1_bn1_gamma: Tensor[(256), float32], %stage4_unit1_bn1_beta: Tensor[(256), float32], %stage4_unit1_bn1_moving_mean: Tensor[(256), float32], %stage4_unit1_bn1_moving_var: Tensor[(256), float32], %stage4_unit1_conv1_weight: Tensor[(512, 256, 3, 3), float32], %stage4_unit1_bn2_gamma: Tensor[(512), float32], %stage4_unit1_bn2_beta: Tensor[(512), float32], %stage4_unit1_bn2_moving_mean: Tensor[(512), float32], %stage4_unit1_bn2_moving_var: Tensor[(512), float32], %stage4_unit1_conv2_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit1_sc_weight: Tensor[(512, 256, 1, 1), float32], %stage4_unit2_bn1_gamma: Tensor[(512), float32], %stage4_unit2_bn1_beta: Tensor[(512), float32], %stage4_unit2_bn1_moving_mean: Tensor[(512), float32], %stage4_unit2_bn1_moving_var: Tensor[(512), float32], %stage4_unit2_conv1_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit2_bn2_gamma: Tensor[(512), float32], %stage4_unit2_bn2_beta: Tensor[(512), float32], %stage4_unit2_bn2_moving_mean: Tensor[(512), float32], %stage4_unit2_bn2_moving_var: Tensor[(512), float32], %stage4_unit2_conv2_weight: Tensor[(512, 512, 3, 3), float32], %bn1_gamma: Tensor[(512), float32], %bn1_beta: Tensor[(512), float32], %bn1_moving_mean: Tensor[(512), float32], %bn1_moving_var: Tensor[(512), float32], %fc1_weight: Tensor[(1000, 512), float32], %fc1_bias: Tensor[(1000), float32]) -> Tensor[(1, 1000), float32] {
  3. %0 = nn.batch_norm(%data, %bn_data_gamma, %bn_data_beta, %bn_data_moving_mean, %bn_data_moving_var, epsilon=2e-05f, scale=False) /* ty=(Tensor[(1, 3, 224, 224), float32], Tensor[(3), float32], Tensor[(3), float32]) */;
  4. %1 = %0.0;
  5. %2 = nn.conv2d(%1, %conv0_weight, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] */;
  6. %3 = nn.batch_norm(%2, %bn0_gamma, %bn0_beta, %bn0_moving_mean, %bn0_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 112, 112), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  7. %4 = %3.0;
  8. %5 = nn.relu(%4) /* ty=Tensor[(1, 64, 112, 112), float32] */;
  9. %6 = nn.max_pool2d(%5, pool_size=[3, 3], strides=[2, 2], padding=[1, 1, 1, 1]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  10. %7 = nn.batch_norm(%6, %stage1_unit1_bn1_gamma, %stage1_unit1_bn1_beta, %stage1_unit1_bn1_moving_mean, %stage1_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  11. %8 = %7.0;
  12. %9 = nn.relu(%8) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  13. %10 = nn.conv2d(%9, %stage1_unit1_conv1_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  14. %11 = nn.batch_norm(%10, %stage1_unit1_bn2_gamma, %stage1_unit1_bn2_beta, %stage1_unit1_bn2_moving_mean, %stage1_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  15. %12 = %11.0;
  16. %13 = nn.relu(%12) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  17. %14 = nn.conv2d(%13, %stage1_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  18. %15 = nn.conv2d(%9, %stage1_unit1_sc_weight, padding=[0, 0, 0, 0], channels=64, kernel_size=[1, 1]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  19. %16 = add(%14, %15) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  20. %17 = nn.batch_norm(%16, %stage1_unit2_bn1_gamma, %stage1_unit2_bn1_beta, %stage1_unit2_bn1_moving_mean, %stage1_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  21. %18 = %17.0;
  22. %19 = nn.relu(%18) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  23. %20 = nn.conv2d(%19, %stage1_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  24. %21 = nn.batch_norm(%20, %stage1_unit2_bn2_gamma, %stage1_unit2_bn2_beta, %stage1_unit2_bn2_moving_mean, %stage1_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  25. %22 = %21.0;
  26. %23 = nn.relu(%22) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  27. %24 = nn.conv2d(%23, %stage1_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  28. %25 = add(%24, %16) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  29. %26 = nn.batch_norm(%25, %stage2_unit1_bn1_gamma, %stage2_unit1_bn1_beta, %stage2_unit1_bn1_moving_mean, %stage2_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;
  30. %27 = %26.0;
  31. %28 = nn.relu(%27) /* ty=Tensor[(1, 64, 56, 56), float32] */;
  32. %29 = nn.conv2d(%28, %stage2_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  33. %30 = nn.batch_norm(%29, %stage2_unit1_bn2_gamma, %stage2_unit1_bn2_beta, %stage2_unit1_bn2_moving_mean, %stage2_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;
  34. %31 = %30.0;
  35. %32 = nn.relu(%31) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  36. %33 = nn.conv2d(%32, %stage2_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  37. %34 = nn.conv2d(%28, %stage2_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=128, kernel_size=[1, 1]) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  38. %35 = add(%33, %34) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  39. %36 = nn.batch_norm(%35, %stage2_unit2_bn1_gamma, %stage2_unit2_bn1_beta, %stage2_unit2_bn1_moving_mean, %stage2_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;
  40. %37 = %36.0;
  41. %38 = nn.relu(%37) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  42. %39 = nn.conv2d(%38, %stage2_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  43. %40 = nn.batch_norm(%39, %stage2_unit2_bn2_gamma, %stage2_unit2_bn2_beta, %stage2_unit2_bn2_moving_mean, %stage2_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;
  44. %41 = %40.0;
  45. %42 = nn.relu(%41) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  46. %43 = nn.conv2d(%42, %stage2_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  47. %44 = add(%43, %35) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  48. %45 = nn.batch_norm(%44, %stage3_unit1_bn1_gamma, %stage3_unit1_bn1_beta, %stage3_unit1_bn1_moving_mean, %stage3_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;
  49. %46 = %45.0;
  50. %47 = nn.relu(%46) /* ty=Tensor[(1, 128, 28, 28), float32] */;
  51. %48 = nn.conv2d(%47, %stage3_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  52. %49 = nn.batch_norm(%48, %stage3_unit1_bn2_gamma, %stage3_unit1_bn2_beta, %stage3_unit1_bn2_moving_mean, %stage3_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  53. %50 = %49.0;
  54. %51 = nn.relu(%50) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  55. %52 = nn.conv2d(%51, %stage3_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  56. %53 = nn.conv2d(%47, %stage3_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=256, kernel_size=[1, 1]) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  57. %54 = add(%52, %53) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  58. %55 = nn.batch_norm(%54, %stage3_unit2_bn1_gamma, %stage3_unit2_bn1_beta, %stage3_unit2_bn1_moving_mean, %stage3_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  59. %56 = %55.0;
  60. %57 = nn.relu(%56) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  61. %58 = nn.conv2d(%57, %stage3_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  62. %59 = nn.batch_norm(%58, %stage3_unit2_bn2_gamma, %stage3_unit2_bn2_beta, %stage3_unit2_bn2_moving_mean, %stage3_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  63. %60 = %59.0;
  64. %61 = nn.relu(%60) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  65. %62 = nn.conv2d(%61, %stage3_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  66. %63 = add(%62, %54) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  67. %64 = nn.batch_norm(%63, %stage4_unit1_bn1_gamma, %stage4_unit1_bn1_beta, %stage4_unit1_bn1_moving_mean, %stage4_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;
  68. %65 = %64.0;
  69. %66 = nn.relu(%65) /* ty=Tensor[(1, 256, 14, 14), float32] */;
  70. %67 = nn.conv2d(%66, %stage4_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  71. %68 = nn.batch_norm(%67, %stage4_unit1_bn2_gamma, %stage4_unit1_bn2_beta, %stage4_unit1_bn2_moving_mean, %stage4_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;
  72. %69 = %68.0;
  73. %70 = nn.relu(%69) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  74. %71 = nn.conv2d(%70, %stage4_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  75. %72 = nn.conv2d(%66, %stage4_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=512, kernel_size=[1, 1]) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  76. %73 = add(%71, %72) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  77. %74 = nn.batch_norm(%73, %stage4_unit2_bn1_gamma, %stage4_unit2_bn1_beta, %stage4_unit2_bn1_moving_mean, %stage4_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;
  78. %75 = %74.0;
  79. %76 = nn.relu(%75) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  80. %77 = nn.conv2d(%76, %stage4_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  81. %78 = nn.batch_norm(%77, %stage4_unit2_bn2_gamma, %stage4_unit2_bn2_beta, %stage4_unit2_bn2_moving_mean, %stage4_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;
  82. %79 = %78.0;
  83. %80 = nn.relu(%79) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  84. %81 = nn.conv2d(%80, %stage4_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  85. %82 = add(%81, %73) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  86. %83 = nn.batch_norm(%82, %bn1_gamma, %bn1_beta, %bn1_moving_mean, %bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;
  87. %84 = %83.0;
  88. %85 = nn.relu(%84) /* ty=Tensor[(1, 512, 7, 7), float32] */;
  89. %86 = nn.global_avg_pool2d(%85) /* ty=Tensor[(1, 512, 1, 1), float32] */;
  90. %87 = nn.batch_flatten(%86) /* ty=Tensor[(1, 512), float32] */;
  91. %88 = nn.dense(%87, %fc1_weight, units=1000) /* ty=Tensor[(1, 1000), float32] */;
  92. %89 = nn.bias_add(%88, %fc1_bias, axis=-1) /* ty=Tensor[(1, 1000), float32] */;
  93. nn.softmax(%89) /* ty=Tensor[(1, 1000), float32] */
  94. }

编译

下一步是使用Relay / TVM pipeline 编译模型,可以指定编译的优化级别从0到3。优化过程包括算子融合、预计算、布局转换等。
relay.build的返回值包括三部分:json格式的执行图、该执行图在目标硬件上的已编译函数TVM模块库,以及模型参数。在编译过程中,Relay进行图级优化,而TVM进行张量级优化,从而得到模型的优化后运行时模块。

我们首先为Nvidia GPU进行编译。在这背后,relay.build首先执行许多图级优化,例如修剪分支、融合等操作,然后将算子(即优化图的节点)注册到TVM实现中生成tvm.module。为了生成模块库,TVM首先将高级别IR转换为指定目标后端的intrinsic IR,在此示例中目标后端为CUDA。然后,生成机器码模块库。

  1. opt_level = 3
  2. target = tvm.target.cuda()
  3. with relay.build_config(opt_level=opt_level):
  4. graph, lib, params = relay.build(mod, target, params=params)

输出:

  1. ...1%, 0.01 MB, 35 KB/s, 0 seconds passed
  2. ...3%, 0.02 MB, 71 KB/s, 0 seconds passed
  3. ...5%, 0.02 MB, 107 KB/s, 0 seconds passed
  4. ...7%, 0.03 MB, 142 KB/s, 0 seconds passed
  5. ...9%, 0.04 MB, 178 KB/s, 0 seconds passed
  6. ...11%, 0.05 MB, 213 KB/s, 0 seconds passed
  7. ...13%, 0.05 MB, 248 KB/s, 0 seconds passed
  8. ...15%, 0.06 MB, 283 KB/s, 0 seconds passed
  9. ...17%, 0.07 MB, 318 KB/s, 0 seconds passed
  10. ...19%, 0.08 MB, 353 KB/s, 0 seconds passed
  11. ...21%, 0.09 MB, 387 KB/s, 0 seconds passed
  12. ...23%, 0.09 MB, 422 KB/s, 0 seconds passed
  13. ...25%, 0.10 MB, 457 KB/s, 0 seconds passed
  14. ...27%, 0.11 MB, 490 KB/s, 0 seconds passed
  15. ...29%, 0.12 MB, 525 KB/s, 0 seconds passed
  16. ...31%, 0.12 MB, 559 KB/s, 0 seconds passed
  17. ...33%, 0.13 MB, 593 KB/s, 0 seconds passed
  18. ...35%, 0.14 MB, 627 KB/s, 0 seconds passed
  19. ...37%, 0.15 MB, 661 KB/s, 0 seconds passed
  20. ...39%, 0.16 MB, 695 KB/s, 0 seconds passed
  21. ...41%, 0.16 MB, 729 KB/s, 0 seconds passed
  22. ...43%, 0.17 MB, 763 KB/s, 0 seconds passed
  23. ...45%, 0.18 MB, 797 KB/s, 0 seconds passed
  24. ...47%, 0.19 MB, 830 KB/s, 0 seconds passed
  25. ...49%, 0.20 MB, 863 KB/s, 0 seconds passed
  26. ...51%, 0.20 MB, 897 KB/s, 0 seconds passed
  27. ...53%, 0.21 MB, 929 KB/s, 0 seconds passed
  28. ...55%, 0.22 MB, 963 KB/s, 0 seconds passed
  29. ...57%, 0.23 MB, 997 KB/s, 0 seconds passed
  30. ...59%, 0.23 MB, 1031 KB/s, 0 seconds passed
  31. ...61%, 0.24 MB, 1062 KB/s, 0 seconds passed
  32. ...63%, 0.25 MB, 1096 KB/s, 0 seconds passed
  33. ...65%, 0.26 MB, 1129 KB/s, 0 seconds passed
  34. ...67%, 0.27 MB, 1163 KB/s, 0 seconds passed
  35. ...69%, 0.27 MB, 1193 KB/s, 0 seconds passed
  36. ...71%, 0.28 MB, 1227 KB/s, 0 seconds passed
  37. ...73%, 0.29 MB, 1261 KB/s, 0 seconds passed
  38. ...75%, 0.30 MB, 1294 KB/s, 0 seconds passed
  39. ...77%, 0.30 MB, 1324 KB/s, 0 seconds passed
  40. ...79%, 0.31 MB, 1357 KB/s, 0 seconds passed
  41. ...81%, 0.32 MB, 1391 KB/s, 0 seconds passed
  42. ...83%, 0.33 MB, 1424 KB/s, 0 seconds passed
  43. ...85%, 0.34 MB, 1456 KB/s, 0 seconds passed
  44. ...87%, 0.34 MB, 1490 KB/s, 0 seconds passed
  45. ...89%, 0.35 MB, 1522 KB/s, 0 seconds passed
  46. ...91%, 0.36 MB, 1555 KB/s, 0 seconds passed
  47. ...93%, 0.37 MB, 1587 KB/s, 0 seconds passed
  48. ...95%, 0.38 MB, 1620 KB/s, 0 seconds passed
  49. ...97%, 0.38 MB, 1652 KB/s, 0 seconds passed
  50. ...99%, 0.39 MB, 1685 KB/s, 0 seconds passed
  51. ...100%, 0.40 MB, 1717 KB/s, 0 seconds passed
  52. Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
  53. Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
  54. Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
  55. Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
  56. Cannot find config for target=cuda -model=unknown, workload=('dense_small_batch.cuda', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

运行生成库

现在,我们可以创建图运行时,并在Nvidia GPU上运行该模块。

  1. # create random input
  2. ctx = tvm.gpu()
  3. data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
  4. # create module
  5. module = graph_runtime.create(graph, lib, ctx)
  6. # set input and parameters
  7. module.set_input("data", data)
  8. module.set_input(**params)
  9. # run
  10. module.run()
  11. # get output
  12. out = module.get_output(0, tvm.nd.empty(out_shape)).asnumpy()
  13. # Print first 10 elements of output
  14. print(out.flatten()[0:10])

输出:

  1. [0.00089283 0.00103331 0.0009094 0.00102275 0.00108751 0.00106737
  2. 0.00106262 0.00095838 0.00110792 0.00113151]

保存并加载编译模块

我们还可以将图、库和参数保存到文件中,然后将它们重新加载到部署环境中。

  1. # save the graph, lib and params into separate files
  2. from tvm.contrib import util
  3. temp = util.tempdir()
  4. path_lib = temp.relpath("deploy_lib.tar")
  5. lib.export_library(path_lib)
  6. with open(temp.relpath("deploy_graph.json"), "w") as fo:
  7. fo.write(graph)
  8. with open(temp.relpath("deploy_param.params"), "wb") as fo:
  9. fo.write(relay.save_param_dict(params))
  10. print(temp.listdir())

输出:
[‘deploy_lib.tar’, ‘deploy_param.params’, ‘deploy_graph.json’]**

加载已编译TVM模块库、json执行图以及模型参数文件

代码如下:

  1. # load the module back.
  2. loaded_json = open(temp.relpath("deploy_graph.json")).read()
  3. loaded_lib = tvm.runtime.load_module(path_lib)
  4. loaded_params = bytearray(open(temp.relpath("deploy_param.params"), "rb").read())
  5. input_data = tvm.nd.array(np.random.uniform(size=data_shape).astype("float32"))
  6. module = graph_runtime.create(loaded_json, loaded_lib, ctx)
  7. module.load_params(loaded_params)
  8. module.run(data=input_data)
  9. out_deploy = module.get_output(0).asnumpy()
  10. # Print first 10 elements of output
  11. print(out_deploy.flatten()[0:10])
  12. # check whether the output from deployed module is consistent with original one
  13. tvm.testing.assert_allclose(out_deploy, out, atol=1e-3)

输出:
[0.00090713 0.00105705 0.00094459 0.00103146 0.00110017 0.00105846
0.00104143 0.00095862 0.0010827 0.00111618]

Download Python source code: relay_quick_start.py
Download Jupyter notebook: relay_quick_start.ipynb