TVM笔记(1) - 知乎

创建时间: 2020-06-05 23:03
更新时间: 2020-06-06 22:24
标签: TVM
来源: https://zhuanlan.zhihu.com/p/144488522

接下来,用一个实际的例子来尝试一下上文的流程。这里我选一个颜色空间转换,RGB转YUV作为例子,这个是在视觉pipeline里面常见的。
下面代码用到的环境,及一些常量的定义:

  1. w=1920
  2. h=1080
  3. c=3
  4. target = 'llvm'

硬件环境: Jetson Xavier

1,默认状态下的性能

首先用TE(Tensor Expression)描述计算过程:

  1. def rgb2y(i,j,k):
  2. return rgb[i,j,0]*0.257+rgb[i,j,1]*0.504+rgb[i,j,2]*0.098
  3. def rgb2u(i,j,k):
  4. return rgb[i,j,0]*-0.148+rgb[i,j,1]*-0.291+rgb[i,j,2]*0.439
  5. def rgb2v(i,j,k):
  6. return rgb[i,j,0]*0.439+rgb[i,j,1]*-0.368+rgb[i,j,2]*-0.071
  7. def rgb_yuv_schedule(w, h, c=3):
  8. rgb = te.placeholder((w,h,c), name='rgb')
  9. rgb2yuv=lambda i,j,k:te.if_then_else(k==0,rgb2y(i,j,k), te.if_then_else(k==1,rgb2u(i,j,k),rgb2v(i,j,k)))
  10. yuv = te.compute((w,h,c), rgb2yuv, name='yuv')
  11. s = te.create_schedule(yuv.op)
  12. return s, [rgb, yuv]
  • te.placeholder定义了输入
  • te.compute定义了计算规则, 关键的两个参数

    • • 第一个是shape,定义计算结果的纬度
    • • 第二个是fcompute,即一个计算函数,根据shape,输出的每个点都会调用一次fcompute
      1. # te.compute(shape, fcompute)对应于上述rgb2yuv例子的伪代码
      2. for i,j,k in shape:
      3. result[i,j,k] = fcompute(i,j,k)
      fcomput里面用到的运算,都得是TVM支持的,比如if_then_else 。上面rgb2yuv的lambda写得比较搓,如果有更优雅的写法,欢迎指教。代码虽然搓了点,但是最终经过编译器的对常量的优化,性能上不会有什么影响。
      首先是用于benchmark的代码,借鉴自Dive into Deep Learning Compiler
      1. def bench_workload(workload):
      2. workload(1) # warmup
      3. time = workload(1) # the time to run once
      4. if time > 1: return time
      5. # The number of repeats to measure at least 1 second
      6. num_repeats = max(int(1.0 / time), 5)
      7. return workload(num_repeats) / num_repeats
      接下来实际的跑看看:
      1. def bench_tvm_func(func, name='rgb2yuv'):
      2. a_np = np.random.uniform(size=(w, h, c)).astype(np.float32)
      3. b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)
      4. c_np = a_np.dot(b_np)
      5. ctx = tvm.context(target, 0)
      6. a_tvm = tvm.nd.array(a_np, ctx=ctx)
      7. c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)
      8. func(a_tvm, c_tvm)
      9. tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)
      10. def workload(nrepeats):
      11. evaluator = func.time_evaluator(func.entry_name, ctx, number=400)
      12. return evaluator(a_tvm, c_tvm).mean * nrepeats
      13. print('Time cost of this operator: %f' % bench_workload(workload))
      14. func.save('/tmp/%s.o'%name)
      15. def default_schedule():
      16. s, (rgb,yuv) = rgb_yuv_schedule(w, h)
      17. if target == 'opencl' or target == 'cuda':
      18. bx, tx = s[yuv].split(yuv.op.axis[1], factor=64)
      19. s[yuv].bind(bx, te.thread_axis("blockIdx.x"))
      20. s[yuv].bind(tx, te.thread_axis("threadIdx.x"))
      21. func = tvm.build(s, [rgb,yuv], target=target, name='rgb2yuv')
      22. bench_tvm_func(func, name='rgb2yuv_default')
      23. default_schedule()
      说明:
  • • tvm.build把TE定义的计算编译为针对target的实际运行的代码

  • • 编译出来的func,函数原型func(*args_buf),即func(rgb, yuv)
  • • tvm.nd.array和Numpy用法基本一致,context用于制定内存是位于CPU,还是GPU
  • • target为GPU,因为CUDA对线程数有限制,因此做了个split

在Jetson Xavier上的结果,2.7ms:

  1. Time cost of this operator: 0.002705

作为对比,用Numpy实现,并benchmark:

  1. import numpy as np
  2. # check correctness
  3. timer = timeit.Timer(setup='import numpy as np\n'
  4. 'a_np = np.random.uniform(size=(%d, %d, %d)).astype(np.float32)\n'
  5. 'b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)'%(w,h,c)
  6. , stmt='c_np = a_np.dot(b_np)')
  7. print('time for np.dot((%d, %d)):'%(w,h), bench_workload(timer.timeit))

在我环境里,这个值120ms,大的过分,怀疑是python一些处理的overhead,没有深究。

2,CPU上应用autotuning:

  1. @autotvm.template("multimedia/rgb_yuv_trans")
  2. def rgb_yuv_trans(w, h, c=3):
  3. s,(rgb,yuv) = rgb_yuv_schedule(w, h, c)
  4. x, y = yuv.op.axis[0], yuv.op.axis[1]
  5. cfg = autotvm.get_config()
  6. if target == 'opencl' or target == 'cuda':
  7. xy = s[yuv].fuse(x, y)
  8. #### define space begin ####
  9. cfg.define_split("xy_split", cfg.axis(x.dom.extent.value * y.dom.extent.value), num_outputs=2)
  10. #### define space end ####
  11. bx, tx = cfg["xy_split"].apply(s, yuv, xy)
  12. s[yuv].bind(bx, te.thread_axis("blockIdx.x"))
  13. s[yuv].bind(tx, te.thread_axis("threadIdx.x"))
  14. s[yuv].unroll(yuv.op.axis[2])
  15. else:
  16. #### define space begin ####
  17. cfg.define_split("tile_y", y, num_outputs=2)
  18. #### define space end ####
  19. yo, yi = cfg["tile_y"].apply(s, yuv, y)
  20. xy = s[yuv].fuse(x, yo)
  21. s[yuv].parallel(xy)
  22. s[yuv].vectorize(yi)
  23. s[yuv].unroll(yuv.op.axis[2])
  24. return s, [rgb, yuv]

说明:

  • • 通过autotvm.template这个decorator,指明这是一个可tune的函数
  • • xy_split的定义,直接使用了cfg.axis,因为TVM的限制,直接放xy会出错
  • • 这里针对cpu和gpu,分别进行了支持

训练并评估:

  1. def tune_task():
  2. task = autotvm.task.create("multimedia/rgb_yuv_trans", args=(w, h), target=target)
  3. print(task.config_space)
  4. measure_option = autotvm.measure_option(builder='local', runner=autotvm.LocalRunner(number=5))
  5. tuner=autotvm.tuner.RandomTuner(task)
  6. if 0:
  7. n_trial = len(task.config_space)
  8. else:
  9. n_trial = 10
  10. tuner.tune(n_trial=n_trial, measure_option=measure_option, callbacks=[autotvm.callback.log_to_file('rgb2yuv.log')])
  11. def evaluate_task():
  12. # apply history best from log file
  13. with autotvm.apply_history_best('rgb2yuv.log'):
  14. with tvm.target.create(target):
  15. s, arg_bufs = rgb_yuv_trans(w, h)
  16. func = tvm.build(s, arg_bufs, name='rgb2yuv')
  17. bench_tvm_func(func)
  18. import logging
  19. import sys
  20. logging.getLogger('autotvm').setLevel(logging.DEBUG)
  21. logging.getLogger('autotvm').addHandler(logging.StreamHandler(sys.stdout))
  22. tune_task()
  23. evaluate_task()

为了跑得快一点,n_trial设置为10的输出结果:

  1. ConfigSpace (len=32, space_map=
  2. 0 tile_y: Split(policy=factors, product=1080, num_outputs=2) len=32
  3. )
  4. Get devices for measurement successfully!
  5. No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7626786) [('tile_y', [-1, 540])],None,30
  6. No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7632723) [('tile_y', [-1, 216])],None,27
  7. No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.787355) [('tile_y', [-1, 270])],None,28
  8. No: 4 GFLOPS: 25.26/25.26 result: MeasureResult(costs=(0.004187388,), error_no=0, all_cost=8.747061729431152, timestamp=1590735019.2084217) [('tile_y', [-1, 180])],None,26
  9. No: 5 GFLOPS: 74.88/74.88 result: MeasureResult(costs=(0.001412358,), error_no=0, all_cost=0.41734933853149414, timestamp=1590735019.6017284) [('tile_y', [-1, 4])],None,3
  10. No: 6 GFLOPS: 66.83/74.88 result: MeasureResult(costs=(0.0015824143999999998,), error_no=0, all_cost=0.4230306148529053, timestamp=1590735019.9837952) [('tile_y', [-1, 5])],None,4
  11. No: 7 GFLOPS: 91.06/91.06 result: MeasureResult(costs=(0.0011614144,), error_no=0, all_cost=0.5114855766296387, timestamp=1590735020.364343) [('tile_y', [-1, 12])],None,9
  12. No: 8 GFLOPS: 89.31/91.06 result: MeasureResult(costs=(0.0011841097999999999,), error_no=0, all_cost=0.38067150115966797, timestamp=1590735020.7485013) [('tile_y', [-1, 3])],None,2
  13. No: 9 GFLOPS: 0.00/91.06 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735030.9819949) [('tile_y', [-1, 360])],None,29
  14. No: 10 GFLOPS: 71.80/91.06 result: MeasureResult(costs=(0.0014729818,), error_no=0, all_cost=0.6799407005310059, timestamp=1590735031.3066556) [('tile_y', [-1, 24])],None,13
  15. Finish loading 90 records
  16. Time cost of this operator: 0.000682

调优后的运行时间是0.68ms,还是有比较大的提升的。

3,GPU上的结果

  1. Time cost of this operator: 0.009534
  2. Get devices for measurement successfully!
  3. No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0365450382232666, timestamp=1590743689.5894103) [('xy_split', [-1, 3072])],None,97
  4. No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03839516639709473, timestamp=1590743689.5899727) [('xy_split', [-1, 64800])],None,146
  5. No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.039269208908081055, timestamp=1590743689.618169) [('xy_split', [-1, 57600])],None,145
  6. No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/pytho
  7. n/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03884124755859375, timestamp=1590743689.6184874) [('xy_split', [-1, 103680])],None,151
  8. No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.035482168197631836, timestamp=1590743689.6187103) [('xy_split', [-1, 2160])],None,90
  9. No: 6 GFLOPS: 160.59/160.59 result: MeasureResult(costs=(0.000658552,), error_no=0, all_cost=1.1336748600006104, timestamp=1590743690.9209) [('xy_split', [-1, 80])],None,29
  10. No: 7 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03384804725646973, timestamp=1590743690.3713021) [('xy_split', [-1, 10800])],None,121
  11. No: 8 GFLOPS: 155.19/160.59 result: MeasureResult(costs=(0.0006814266,), error_no=0, all_cost=1.0592899322509766, timestamp=1590743691.4001758) [('xy_split', [-1, 240])],None,47
  12. No: 9 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.02752852439880371, timestamp=1590743691.6169024) [('xy_split', [-1, 1280])],None,79
  13. No: 10 GFLOPS: 141.10/160.59 result: MeasureResult(costs=(0.0007495002,), error_no=0, all_cost=1.0822956562042236, timestamp=1590743692.7331798) [('xy_split', [-1, 405])],None,57
  14. Finish loading 100 records
  15. Time cost of this operator: 0.000554

autotuning之前是9.5ms,之后是0.55ms

4,小结

对最终的结果,GPU比CPU快的不是特别明显,所以有点存疑,可能是我优化的不到位,或者其他原因。欢迎一起探索。
最后附上完整的python文件:
TVM笔记(1) - 知乎 - 图1