深度学习模型编译快速入门教程
作者: Yao Wang, Truman Tian
本例显示了如何使用Relay python前端构建神经网络,以及如何使用TVM为Nvidia GPU生成运行时库。请注意,您需要在启用cuda和llvm的情况下构建TVM。
TVM支持的硬件后端概述
下图显示了TVM当前支持的硬件后端:![[翻译]深度学习模型编译快速入门教程 - 图1](/uploads/projects/ixxw@ai/5e5d525acc7e79880151e9fb1d688148.jpeg)
在本教程中,我们将选择cuda和llvm作为目标后端。
首先,让我们导入Relay和TVM。
import numpy as npfrom tvm import relayfrom tvm.relay import testingimport tvmfrom tvm import tefrom tvm.contrib import graph_runtime
在Relay中定义神经网络
首先,用Relay python前端定义神经网络。为简单起见,在Relay中使用预定义的resnet-18网络,使用Xavier初始化程序初始化参数。Relay还支持其他模型格式,例如MXNet、CoreML、ONNX和Tensorflow。
本教程假设在GPU设备上进行推理,批大小设置为1,输入图像是大小为224 * 224的RGB彩色图像,可以调用tvm.relay.expr.TupleWrapper.astext()显示网络结构。
batch_size = 1num_class = 1000image_shape = (3, 224, 224)data_shape = (batch_size,) + image_shapeout_shape = (batch_size, num_class)mod, params = relay.testing.resnet.get_workload(num_layers=18, batch_size=batch_size, image_shape=image_shape)# set show_meta_data=True if you want to show meta dataprint(mod.astext(show_meta_data=False))
输出:
v0.0.4def @main(%data: Tensor[(1, 3, 224, 224), float32], %bn_data_gamma: Tensor[(3), float32], %bn_data_beta: Tensor[(3), float32], %bn_data_moving_mean: Tensor[(3), float32], %bn_data_moving_var: Tensor[(3), float32], %conv0_weight: Tensor[(64, 3, 7, 7), float32], %bn0_gamma: Tensor[(64), float32], %bn0_beta: Tensor[(64), float32], %bn0_moving_mean: Tensor[(64), float32], %bn0_moving_var: Tensor[(64), float32], %stage1_unit1_bn1_gamma: Tensor[(64), float32], %stage1_unit1_bn1_beta: Tensor[(64), float32], %stage1_unit1_bn1_moving_mean: Tensor[(64), float32], %stage1_unit1_bn1_moving_var: Tensor[(64), float32], %stage1_unit1_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_bn2_gamma: Tensor[(64), float32], %stage1_unit1_bn2_beta: Tensor[(64), float32], %stage1_unit1_bn2_moving_mean: Tensor[(64), float32], %stage1_unit1_bn2_moving_var: Tensor[(64), float32], %stage1_unit1_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit1_sc_weight: Tensor[(64, 64, 1, 1), float32], %stage1_unit2_bn1_gamma: Tensor[(64), float32], %stage1_unit2_bn1_beta: Tensor[(64), float32], %stage1_unit2_bn1_moving_mean: Tensor[(64), float32], %stage1_unit2_bn1_moving_var: Tensor[(64), float32], %stage1_unit2_conv1_weight: Tensor[(64, 64, 3, 3), float32], %stage1_unit2_bn2_gamma: Tensor[(64), float32], %stage1_unit2_bn2_beta: Tensor[(64), float32], %stage1_unit2_bn2_moving_mean: Tensor[(64), float32], %stage1_unit2_bn2_moving_var: Tensor[(64), float32], %stage1_unit2_conv2_weight: Tensor[(64, 64, 3, 3), float32], %stage2_unit1_bn1_gamma: Tensor[(64), float32], %stage2_unit1_bn1_beta: Tensor[(64), float32], %stage2_unit1_bn1_moving_mean: Tensor[(64), float32], %stage2_unit1_bn1_moving_var: Tensor[(64), float32], %stage2_unit1_conv1_weight: Tensor[(128, 64, 3, 3), float32], %stage2_unit1_bn2_gamma: Tensor[(128), float32], %stage2_unit1_bn2_beta: Tensor[(128), float32], %stage2_unit1_bn2_moving_mean: Tensor[(128), float32], %stage2_unit1_bn2_moving_var: Tensor[(128), float32], %stage2_unit1_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit1_sc_weight: Tensor[(128, 64, 1, 1), float32], %stage2_unit2_bn1_gamma: Tensor[(128), float32], %stage2_unit2_bn1_beta: Tensor[(128), float32], %stage2_unit2_bn1_moving_mean: Tensor[(128), float32], %stage2_unit2_bn1_moving_var: Tensor[(128), float32], %stage2_unit2_conv1_weight: Tensor[(128, 128, 3, 3), float32], %stage2_unit2_bn2_gamma: Tensor[(128), float32], %stage2_unit2_bn2_beta: Tensor[(128), float32], %stage2_unit2_bn2_moving_mean: Tensor[(128), float32], %stage2_unit2_bn2_moving_var: Tensor[(128), float32], %stage2_unit2_conv2_weight: Tensor[(128, 128, 3, 3), float32], %stage3_unit1_bn1_gamma: Tensor[(128), float32], %stage3_unit1_bn1_beta: Tensor[(128), float32], %stage3_unit1_bn1_moving_mean: Tensor[(128), float32], %stage3_unit1_bn1_moving_var: Tensor[(128), float32], %stage3_unit1_conv1_weight: Tensor[(256, 128, 3, 3), float32], %stage3_unit1_bn2_gamma: Tensor[(256), float32], %stage3_unit1_bn2_beta: Tensor[(256), float32], %stage3_unit1_bn2_moving_mean: Tensor[(256), float32], %stage3_unit1_bn2_moving_var: Tensor[(256), float32], %stage3_unit1_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit1_sc_weight: Tensor[(256, 128, 1, 1), float32], %stage3_unit2_bn1_gamma: Tensor[(256), float32], %stage3_unit2_bn1_beta: Tensor[(256), float32], %stage3_unit2_bn1_moving_mean: Tensor[(256), float32], %stage3_unit2_bn1_moving_var: Tensor[(256), float32], %stage3_unit2_conv1_weight: Tensor[(256, 256, 3, 3), float32], %stage3_unit2_bn2_gamma: Tensor[(256), float32], %stage3_unit2_bn2_beta: Tensor[(256), float32], %stage3_unit2_bn2_moving_mean: Tensor[(256), float32], %stage3_unit2_bn2_moving_var: Tensor[(256), float32], %stage3_unit2_conv2_weight: Tensor[(256, 256, 3, 3), float32], %stage4_unit1_bn1_gamma: Tensor[(256), float32], %stage4_unit1_bn1_beta: Tensor[(256), float32], %stage4_unit1_bn1_moving_mean: Tensor[(256), float32], %stage4_unit1_bn1_moving_var: Tensor[(256), float32], %stage4_unit1_conv1_weight: Tensor[(512, 256, 3, 3), float32], %stage4_unit1_bn2_gamma: Tensor[(512), float32], %stage4_unit1_bn2_beta: Tensor[(512), float32], %stage4_unit1_bn2_moving_mean: Tensor[(512), float32], %stage4_unit1_bn2_moving_var: Tensor[(512), float32], %stage4_unit1_conv2_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit1_sc_weight: Tensor[(512, 256, 1, 1), float32], %stage4_unit2_bn1_gamma: Tensor[(512), float32], %stage4_unit2_bn1_beta: Tensor[(512), float32], %stage4_unit2_bn1_moving_mean: Tensor[(512), float32], %stage4_unit2_bn1_moving_var: Tensor[(512), float32], %stage4_unit2_conv1_weight: Tensor[(512, 512, 3, 3), float32], %stage4_unit2_bn2_gamma: Tensor[(512), float32], %stage4_unit2_bn2_beta: Tensor[(512), float32], %stage4_unit2_bn2_moving_mean: Tensor[(512), float32], %stage4_unit2_bn2_moving_var: Tensor[(512), float32], %stage4_unit2_conv2_weight: Tensor[(512, 512, 3, 3), float32], %bn1_gamma: Tensor[(512), float32], %bn1_beta: Tensor[(512), float32], %bn1_moving_mean: Tensor[(512), float32], %bn1_moving_var: Tensor[(512), float32], %fc1_weight: Tensor[(1000, 512), float32], %fc1_bias: Tensor[(1000), float32]) -> Tensor[(1, 1000), float32] {%0 = nn.batch_norm(%data, %bn_data_gamma, %bn_data_beta, %bn_data_moving_mean, %bn_data_moving_var, epsilon=2e-05f, scale=False) /* ty=(Tensor[(1, 3, 224, 224), float32], Tensor[(3), float32], Tensor[(3), float32]) */;%1 = %0.0;%2 = nn.conv2d(%1, %conv0_weight, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] */;%3 = nn.batch_norm(%2, %bn0_gamma, %bn0_beta, %bn0_moving_mean, %bn0_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 112, 112), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%4 = %3.0;%5 = nn.relu(%4) /* ty=Tensor[(1, 64, 112, 112), float32] */;%6 = nn.max_pool2d(%5, pool_size=[3, 3], strides=[2, 2], padding=[1, 1, 1, 1]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%7 = nn.batch_norm(%6, %stage1_unit1_bn1_gamma, %stage1_unit1_bn1_beta, %stage1_unit1_bn1_moving_mean, %stage1_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%8 = %7.0;%9 = nn.relu(%8) /* ty=Tensor[(1, 64, 56, 56), float32] */;%10 = nn.conv2d(%9, %stage1_unit1_conv1_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%11 = nn.batch_norm(%10, %stage1_unit1_bn2_gamma, %stage1_unit1_bn2_beta, %stage1_unit1_bn2_moving_mean, %stage1_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%12 = %11.0;%13 = nn.relu(%12) /* ty=Tensor[(1, 64, 56, 56), float32] */;%14 = nn.conv2d(%13, %stage1_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%15 = nn.conv2d(%9, %stage1_unit1_sc_weight, padding=[0, 0, 0, 0], channels=64, kernel_size=[1, 1]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%16 = add(%14, %15) /* ty=Tensor[(1, 64, 56, 56), float32] */;%17 = nn.batch_norm(%16, %stage1_unit2_bn1_gamma, %stage1_unit2_bn1_beta, %stage1_unit2_bn1_moving_mean, %stage1_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%18 = %17.0;%19 = nn.relu(%18) /* ty=Tensor[(1, 64, 56, 56), float32] */;%20 = nn.conv2d(%19, %stage1_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%21 = nn.batch_norm(%20, %stage1_unit2_bn2_gamma, %stage1_unit2_bn2_beta, %stage1_unit2_bn2_moving_mean, %stage1_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%22 = %21.0;%23 = nn.relu(%22) /* ty=Tensor[(1, 64, 56, 56), float32] */;%24 = nn.conv2d(%23, %stage1_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;%25 = add(%24, %16) /* ty=Tensor[(1, 64, 56, 56), float32] */;%26 = nn.batch_norm(%25, %stage2_unit1_bn1_gamma, %stage2_unit1_bn1_beta, %stage2_unit1_bn1_moving_mean, %stage2_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 64, 56, 56), float32], Tensor[(64), float32], Tensor[(64), float32]) */;%27 = %26.0;%28 = nn.relu(%27) /* ty=Tensor[(1, 64, 56, 56), float32] */;%29 = nn.conv2d(%28, %stage2_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;%30 = nn.batch_norm(%29, %stage2_unit1_bn2_gamma, %stage2_unit1_bn2_beta, %stage2_unit1_bn2_moving_mean, %stage2_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;%31 = %30.0;%32 = nn.relu(%31) /* ty=Tensor[(1, 128, 28, 28), float32] */;%33 = nn.conv2d(%32, %stage2_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;%34 = nn.conv2d(%28, %stage2_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=128, kernel_size=[1, 1]) /* ty=Tensor[(1, 128, 28, 28), float32] */;%35 = add(%33, %34) /* ty=Tensor[(1, 128, 28, 28), float32] */;%36 = nn.batch_norm(%35, %stage2_unit2_bn1_gamma, %stage2_unit2_bn1_beta, %stage2_unit2_bn1_moving_mean, %stage2_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;%37 = %36.0;%38 = nn.relu(%37) /* ty=Tensor[(1, 128, 28, 28), float32] */;%39 = nn.conv2d(%38, %stage2_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;%40 = nn.batch_norm(%39, %stage2_unit2_bn2_gamma, %stage2_unit2_bn2_beta, %stage2_unit2_bn2_moving_mean, %stage2_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;%41 = %40.0;%42 = nn.relu(%41) /* ty=Tensor[(1, 128, 28, 28), float32] */;%43 = nn.conv2d(%42, %stage2_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=128, kernel_size=[3, 3]) /* ty=Tensor[(1, 128, 28, 28), float32] */;%44 = add(%43, %35) /* ty=Tensor[(1, 128, 28, 28), float32] */;%45 = nn.batch_norm(%44, %stage3_unit1_bn1_gamma, %stage3_unit1_bn1_beta, %stage3_unit1_bn1_moving_mean, %stage3_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 128, 28, 28), float32], Tensor[(128), float32], Tensor[(128), float32]) */;%46 = %45.0;%47 = nn.relu(%46) /* ty=Tensor[(1, 128, 28, 28), float32] */;%48 = nn.conv2d(%47, %stage3_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;%49 = nn.batch_norm(%48, %stage3_unit1_bn2_gamma, %stage3_unit1_bn2_beta, %stage3_unit1_bn2_moving_mean, %stage3_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;%50 = %49.0;%51 = nn.relu(%50) /* ty=Tensor[(1, 256, 14, 14), float32] */;%52 = nn.conv2d(%51, %stage3_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;%53 = nn.conv2d(%47, %stage3_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=256, kernel_size=[1, 1]) /* ty=Tensor[(1, 256, 14, 14), float32] */;%54 = add(%52, %53) /* ty=Tensor[(1, 256, 14, 14), float32] */;%55 = nn.batch_norm(%54, %stage3_unit2_bn1_gamma, %stage3_unit2_bn1_beta, %stage3_unit2_bn1_moving_mean, %stage3_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;%56 = %55.0;%57 = nn.relu(%56) /* ty=Tensor[(1, 256, 14, 14), float32] */;%58 = nn.conv2d(%57, %stage3_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;%59 = nn.batch_norm(%58, %stage3_unit2_bn2_gamma, %stage3_unit2_bn2_beta, %stage3_unit2_bn2_moving_mean, %stage3_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;%60 = %59.0;%61 = nn.relu(%60) /* ty=Tensor[(1, 256, 14, 14), float32] */;%62 = nn.conv2d(%61, %stage3_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=256, kernel_size=[3, 3]) /* ty=Tensor[(1, 256, 14, 14), float32] */;%63 = add(%62, %54) /* ty=Tensor[(1, 256, 14, 14), float32] */;%64 = nn.batch_norm(%63, %stage4_unit1_bn1_gamma, %stage4_unit1_bn1_beta, %stage4_unit1_bn1_moving_mean, %stage4_unit1_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 256, 14, 14), float32], Tensor[(256), float32], Tensor[(256), float32]) */;%65 = %64.0;%66 = nn.relu(%65) /* ty=Tensor[(1, 256, 14, 14), float32] */;%67 = nn.conv2d(%66, %stage4_unit1_conv1_weight, strides=[2, 2], padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;%68 = nn.batch_norm(%67, %stage4_unit1_bn2_gamma, %stage4_unit1_bn2_beta, %stage4_unit1_bn2_moving_mean, %stage4_unit1_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;%69 = %68.0;%70 = nn.relu(%69) /* ty=Tensor[(1, 512, 7, 7), float32] */;%71 = nn.conv2d(%70, %stage4_unit1_conv2_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;%72 = nn.conv2d(%66, %stage4_unit1_sc_weight, strides=[2, 2], padding=[0, 0, 0, 0], channels=512, kernel_size=[1, 1]) /* ty=Tensor[(1, 512, 7, 7), float32] */;%73 = add(%71, %72) /* ty=Tensor[(1, 512, 7, 7), float32] */;%74 = nn.batch_norm(%73, %stage4_unit2_bn1_gamma, %stage4_unit2_bn1_beta, %stage4_unit2_bn1_moving_mean, %stage4_unit2_bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;%75 = %74.0;%76 = nn.relu(%75) /* ty=Tensor[(1, 512, 7, 7), float32] */;%77 = nn.conv2d(%76, %stage4_unit2_conv1_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;%78 = nn.batch_norm(%77, %stage4_unit2_bn2_gamma, %stage4_unit2_bn2_beta, %stage4_unit2_bn2_moving_mean, %stage4_unit2_bn2_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;%79 = %78.0;%80 = nn.relu(%79) /* ty=Tensor[(1, 512, 7, 7), float32] */;%81 = nn.conv2d(%80, %stage4_unit2_conv2_weight, padding=[1, 1, 1, 1], channels=512, kernel_size=[3, 3]) /* ty=Tensor[(1, 512, 7, 7), float32] */;%82 = add(%81, %73) /* ty=Tensor[(1, 512, 7, 7), float32] */;%83 = nn.batch_norm(%82, %bn1_gamma, %bn1_beta, %bn1_moving_mean, %bn1_moving_var, epsilon=2e-05f) /* ty=(Tensor[(1, 512, 7, 7), float32], Tensor[(512), float32], Tensor[(512), float32]) */;%84 = %83.0;%85 = nn.relu(%84) /* ty=Tensor[(1, 512, 7, 7), float32] */;%86 = nn.global_avg_pool2d(%85) /* ty=Tensor[(1, 512, 1, 1), float32] */;%87 = nn.batch_flatten(%86) /* ty=Tensor[(1, 512), float32] */;%88 = nn.dense(%87, %fc1_weight, units=1000) /* ty=Tensor[(1, 1000), float32] */;%89 = nn.bias_add(%88, %fc1_bias, axis=-1) /* ty=Tensor[(1, 1000), float32] */;nn.softmax(%89) /* ty=Tensor[(1, 1000), float32] */}
编译
下一步是使用Relay / TVM pipeline 编译模型,可以指定编译的优化级别从0到3。优化过程包括算子融合、预计算、布局转换等。
relay.build的返回值包括三部分:json格式的执行图、该执行图在目标硬件上的已编译函数TVM模块库,以及模型参数。在编译过程中,Relay进行图级优化,而TVM进行张量级优化,从而得到模型的优化后运行时模块。
我们首先为Nvidia GPU进行编译。在这背后,relay.build首先执行许多图级优化,例如修剪分支、融合等操作,然后将算子(即优化图的节点)注册到TVM实现中生成tvm.module。为了生成模块库,TVM首先将高级别IR转换为指定目标后端的intrinsic IR,在此示例中目标后端为CUDA。然后,生成机器码模块库。
opt_level = 3target = tvm.target.cuda()with relay.build_config(opt_level=opt_level):graph, lib, params = relay.build(mod, target, params=params)
输出:
...1%, 0.01 MB, 35 KB/s, 0 seconds passed...3%, 0.02 MB, 71 KB/s, 0 seconds passed...5%, 0.02 MB, 107 KB/s, 0 seconds passed...7%, 0.03 MB, 142 KB/s, 0 seconds passed...9%, 0.04 MB, 178 KB/s, 0 seconds passed...11%, 0.05 MB, 213 KB/s, 0 seconds passed...13%, 0.05 MB, 248 KB/s, 0 seconds passed...15%, 0.06 MB, 283 KB/s, 0 seconds passed...17%, 0.07 MB, 318 KB/s, 0 seconds passed...19%, 0.08 MB, 353 KB/s, 0 seconds passed...21%, 0.09 MB, 387 KB/s, 0 seconds passed...23%, 0.09 MB, 422 KB/s, 0 seconds passed...25%, 0.10 MB, 457 KB/s, 0 seconds passed...27%, 0.11 MB, 490 KB/s, 0 seconds passed...29%, 0.12 MB, 525 KB/s, 0 seconds passed...31%, 0.12 MB, 559 KB/s, 0 seconds passed...33%, 0.13 MB, 593 KB/s, 0 seconds passed...35%, 0.14 MB, 627 KB/s, 0 seconds passed...37%, 0.15 MB, 661 KB/s, 0 seconds passed...39%, 0.16 MB, 695 KB/s, 0 seconds passed...41%, 0.16 MB, 729 KB/s, 0 seconds passed...43%, 0.17 MB, 763 KB/s, 0 seconds passed...45%, 0.18 MB, 797 KB/s, 0 seconds passed...47%, 0.19 MB, 830 KB/s, 0 seconds passed...49%, 0.20 MB, 863 KB/s, 0 seconds passed...51%, 0.20 MB, 897 KB/s, 0 seconds passed...53%, 0.21 MB, 929 KB/s, 0 seconds passed...55%, 0.22 MB, 963 KB/s, 0 seconds passed...57%, 0.23 MB, 997 KB/s, 0 seconds passed...59%, 0.23 MB, 1031 KB/s, 0 seconds passed...61%, 0.24 MB, 1062 KB/s, 0 seconds passed...63%, 0.25 MB, 1096 KB/s, 0 seconds passed...65%, 0.26 MB, 1129 KB/s, 0 seconds passed...67%, 0.27 MB, 1163 KB/s, 0 seconds passed...69%, 0.27 MB, 1193 KB/s, 0 seconds passed...71%, 0.28 MB, 1227 KB/s, 0 seconds passed...73%, 0.29 MB, 1261 KB/s, 0 seconds passed...75%, 0.30 MB, 1294 KB/s, 0 seconds passed...77%, 0.30 MB, 1324 KB/s, 0 seconds passed...79%, 0.31 MB, 1357 KB/s, 0 seconds passed...81%, 0.32 MB, 1391 KB/s, 0 seconds passed...83%, 0.33 MB, 1424 KB/s, 0 seconds passed...85%, 0.34 MB, 1456 KB/s, 0 seconds passed...87%, 0.34 MB, 1490 KB/s, 0 seconds passed...89%, 0.35 MB, 1522 KB/s, 0 seconds passed...91%, 0.36 MB, 1555 KB/s, 0 seconds passed...93%, 0.37 MB, 1587 KB/s, 0 seconds passed...95%, 0.38 MB, 1620 KB/s, 0 seconds passed...97%, 0.38 MB, 1652 KB/s, 0 seconds passed...99%, 0.39 MB, 1685 KB/s, 0 seconds passed...100%, 0.40 MB, 1717 KB/s, 0 seconds passedCannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 64, 56, 56), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 128, 28, 28), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 256, 14, 14), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.Cannot find config for target=cuda -model=unknown, workload=('conv2d_nchw.cuda', ('TENSOR', (1, 512, 7, 7), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.Cannot find config for target=cuda -model=unknown, workload=('dense_small_batch.cuda', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
运行生成库
现在,我们可以创建图运行时,并在Nvidia GPU上运行该模块。
# create random inputctx = tvm.gpu()data = np.random.uniform(-1, 1, size=data_shape).astype("float32")# create modulemodule = graph_runtime.create(graph, lib, ctx)# set input and parametersmodule.set_input("data", data)module.set_input(**params)# runmodule.run()# get outputout = module.get_output(0, tvm.nd.empty(out_shape)).asnumpy()# Print first 10 elements of outputprint(out.flatten()[0:10])
输出:
[0.00089283 0.00103331 0.0009094 0.00102275 0.00108751 0.001067370.00106262 0.00095838 0.00110792 0.00113151]
保存并加载编译模块
我们还可以将图、库和参数保存到文件中,然后将它们重新加载到部署环境中。
# save the graph, lib and params into separate filesfrom tvm.contrib import utiltemp = util.tempdir()path_lib = temp.relpath("deploy_lib.tar")lib.export_library(path_lib)with open(temp.relpath("deploy_graph.json"), "w") as fo:fo.write(graph)with open(temp.relpath("deploy_param.params"), "wb") as fo:fo.write(relay.save_param_dict(params))print(temp.listdir())
输出:
[‘deploy_lib.tar’, ‘deploy_param.params’, ‘deploy_graph.json’]**
加载已编译TVM模块库、json执行图以及模型参数文件
代码如下:
# load the module back.loaded_json = open(temp.relpath("deploy_graph.json")).read()loaded_lib = tvm.runtime.load_module(path_lib)loaded_params = bytearray(open(temp.relpath("deploy_param.params"), "rb").read())input_data = tvm.nd.array(np.random.uniform(size=data_shape).astype("float32"))module = graph_runtime.create(loaded_json, loaded_lib, ctx)module.load_params(loaded_params)module.run(data=input_data)out_deploy = module.get_output(0).asnumpy()# Print first 10 elements of outputprint(out_deploy.flatten()[0:10])# check whether the output from deployed module is consistent with original onetvm.testing.assert_allclose(out_deploy, out, atol=1e-3)
输出:
[0.00090713 0.00105705 0.00094459 0.00103146 0.00110017 0.00105846
0.00104143 0.00095862 0.0010827 0.00111618]
Download Python source code: relay_quick_start.py
Download Jupyter notebook: relay_quick_start.ipynb
