TVM笔记(1) - 知乎
| 创建时间: | 2020-06-05 23:03 |
|---|---|
| 更新时间: | 2020-06-06 22:24 |
| 标签: | TVM |
| 来源: | https://zhuanlan.zhihu.com/p/144488522 |
接下来,用一个实际的例子来尝试一下上文的流程。这里我选一个颜色空间转换,RGB转YUV作为例子,这个是在视觉pipeline里面常见的。
下面代码用到的环境,及一些常量的定义:
w=1920h=1080c=3target = 'llvm'
硬件环境: Jetson Xavier
1,默认状态下的性能
首先用TE(Tensor Expression)描述计算过程:
def rgb2y(i,j,k):return rgb[i,j,0]*0.257+rgb[i,j,1]*0.504+rgb[i,j,2]*0.098def rgb2u(i,j,k):return rgb[i,j,0]*-0.148+rgb[i,j,1]*-0.291+rgb[i,j,2]*0.439def rgb2v(i,j,k):return rgb[i,j,0]*0.439+rgb[i,j,1]*-0.368+rgb[i,j,2]*-0.071def rgb_yuv_schedule(w, h, c=3):rgb = te.placeholder((w,h,c), name='rgb')rgb2yuv=lambda i,j,k:te.if_then_else(k==0,rgb2y(i,j,k), te.if_then_else(k==1,rgb2u(i,j,k),rgb2v(i,j,k)))yuv = te.compute((w,h,c), rgb2yuv, name='yuv')s = te.create_schedule(yuv.op)return s, [rgb, yuv]
- •
te.placeholder定义了输入 •
te.compute定义了计算规则, 关键的两个参数- • 第一个是shape,定义计算结果的纬度
- • 第二个是fcompute,即一个计算函数,根据shape,输出的每个点都会调用一次fcompute
fcomput里面用到的运算,都得是TVM支持的,比如# te.compute(shape, fcompute)对应于上述rgb2yuv例子的伪代码for i,j,k in shape:result[i,j,k] = fcompute(i,j,k)
if_then_else。上面rgb2yuv的lambda写得比较搓,如果有更优雅的写法,欢迎指教。代码虽然搓了点,但是最终经过编译器的对常量的优化,性能上不会有什么影响。
首先是用于benchmark的代码,借鉴自Dive into Deep Learning Compiler:
接下来实际的跑看看:def bench_workload(workload):workload(1) # warmuptime = workload(1) # the time to run onceif time > 1: return time# The number of repeats to measure at least 1 secondnum_repeats = max(int(1.0 / time), 5)return workload(num_repeats) / num_repeats
说明:def bench_tvm_func(func, name='rgb2yuv'):a_np = np.random.uniform(size=(w, h, c)).astype(np.float32)b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)c_np = a_np.dot(b_np)ctx = tvm.context(target, 0)a_tvm = tvm.nd.array(a_np, ctx=ctx)c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)func(a_tvm, c_tvm)tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)def workload(nrepeats):evaluator = func.time_evaluator(func.entry_name, ctx, number=400)return evaluator(a_tvm, c_tvm).mean * nrepeatsprint('Time cost of this operator: %f' % bench_workload(workload))func.save('/tmp/%s.o'%name)def default_schedule():s, (rgb,yuv) = rgb_yuv_schedule(w, h)if target == 'opencl' or target == 'cuda':bx, tx = s[yuv].split(yuv.op.axis[1], factor=64)s[yuv].bind(bx, te.thread_axis("blockIdx.x"))s[yuv].bind(tx, te.thread_axis("threadIdx.x"))func = tvm.build(s, [rgb,yuv], target=target, name='rgb2yuv')bench_tvm_func(func, name='rgb2yuv_default')default_schedule()
• tvm.build把TE定义的计算编译为针对target的实际运行的代码
- • 编译出来的func,函数原型func(*args_buf),即func(rgb, yuv)
- • tvm.nd.array和Numpy用法基本一致,context用于制定内存是位于CPU,还是GPU
- • target为GPU,因为CUDA对线程数有限制,因此做了个split
在Jetson Xavier上的结果,2.7ms:
Time cost of this operator: 0.002705
作为对比,用Numpy实现,并benchmark:
import numpy as np# check correctnesstimer = timeit.Timer(setup='import numpy as np\n''a_np = np.random.uniform(size=(%d, %d, %d)).astype(np.float32)\n''b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)'%(w,h,c), stmt='c_np = a_np.dot(b_np)')print('time for np.dot((%d, %d)):'%(w,h), bench_workload(timer.timeit))
在我环境里,这个值120ms,大的过分,怀疑是python一些处理的overhead,没有深究。
2,CPU上应用autotuning:
@autotvm.template("multimedia/rgb_yuv_trans")def rgb_yuv_trans(w, h, c=3):s,(rgb,yuv) = rgb_yuv_schedule(w, h, c)x, y = yuv.op.axis[0], yuv.op.axis[1]cfg = autotvm.get_config()if target == 'opencl' or target == 'cuda':xy = s[yuv].fuse(x, y)#### define space begin ####cfg.define_split("xy_split", cfg.axis(x.dom.extent.value * y.dom.extent.value), num_outputs=2)#### define space end ####bx, tx = cfg["xy_split"].apply(s, yuv, xy)s[yuv].bind(bx, te.thread_axis("blockIdx.x"))s[yuv].bind(tx, te.thread_axis("threadIdx.x"))s[yuv].unroll(yuv.op.axis[2])else:#### define space begin ####cfg.define_split("tile_y", y, num_outputs=2)#### define space end ####yo, yi = cfg["tile_y"].apply(s, yuv, y)xy = s[yuv].fuse(x, yo)s[yuv].parallel(xy)s[yuv].vectorize(yi)s[yuv].unroll(yuv.op.axis[2])return s, [rgb, yuv]
说明:
- • 通过autotvm.template这个decorator,指明这是一个可tune的函数
- • xy_split的定义,直接使用了cfg.axis,因为TVM的限制,直接放xy会出错
- • 这里针对cpu和gpu,分别进行了支持
训练并评估:
def tune_task():task = autotvm.task.create("multimedia/rgb_yuv_trans", args=(w, h), target=target)print(task.config_space)measure_option = autotvm.measure_option(builder='local', runner=autotvm.LocalRunner(number=5))tuner=autotvm.tuner.RandomTuner(task)if 0:n_trial = len(task.config_space)else:n_trial = 10tuner.tune(n_trial=n_trial, measure_option=measure_option, callbacks=[autotvm.callback.log_to_file('rgb2yuv.log')])def evaluate_task():# apply history best from log filewith autotvm.apply_history_best('rgb2yuv.log'):with tvm.target.create(target):s, arg_bufs = rgb_yuv_trans(w, h)func = tvm.build(s, arg_bufs, name='rgb2yuv')bench_tvm_func(func)import loggingimport syslogging.getLogger('autotvm').setLevel(logging.DEBUG)logging.getLogger('autotvm').addHandler(logging.StreamHandler(sys.stdout))tune_task()evaluate_task()
为了跑得快一点,n_trial设置为10的输出结果:
ConfigSpace (len=32, space_map=0 tile_y: Split(policy=factors, product=1080, num_outputs=2) len=32)Get devices for measurement successfully!No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7626786) [('tile_y', [-1, 540])],None,30No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7632723) [('tile_y', [-1, 216])],None,27No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.787355) [('tile_y', [-1, 270])],None,28No: 4 GFLOPS: 25.26/25.26 result: MeasureResult(costs=(0.004187388,), error_no=0, all_cost=8.747061729431152, timestamp=1590735019.2084217) [('tile_y', [-1, 180])],None,26No: 5 GFLOPS: 74.88/74.88 result: MeasureResult(costs=(0.001412358,), error_no=0, all_cost=0.41734933853149414, timestamp=1590735019.6017284) [('tile_y', [-1, 4])],None,3No: 6 GFLOPS: 66.83/74.88 result: MeasureResult(costs=(0.0015824143999999998,), error_no=0, all_cost=0.4230306148529053, timestamp=1590735019.9837952) [('tile_y', [-1, 5])],None,4No: 7 GFLOPS: 91.06/91.06 result: MeasureResult(costs=(0.0011614144,), error_no=0, all_cost=0.5114855766296387, timestamp=1590735020.364343) [('tile_y', [-1, 12])],None,9No: 8 GFLOPS: 89.31/91.06 result: MeasureResult(costs=(0.0011841097999999999,), error_no=0, all_cost=0.38067150115966797, timestamp=1590735020.7485013) [('tile_y', [-1, 3])],None,2No: 9 GFLOPS: 0.00/91.06 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735030.9819949) [('tile_y', [-1, 360])],None,29No: 10 GFLOPS: 71.80/91.06 result: MeasureResult(costs=(0.0014729818,), error_no=0, all_cost=0.6799407005310059, timestamp=1590735031.3066556) [('tile_y', [-1, 24])],None,13Finish loading 90 recordsTime cost of this operator: 0.000682
3,GPU上的结果
Time cost of this operator: 0.009534Get devices for measurement successfully!No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0365450382232666, timestamp=1590743689.5894103) [('xy_split', [-1, 3072])],None,97No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03839516639709473, timestamp=1590743689.5899727) [('xy_split', [-1, 64800])],None,146No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.039269208908081055, timestamp=1590743689.618169) [('xy_split', [-1, 57600])],None,145No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03884124755859375, timestamp=1590743689.6184874) [('xy_split', [-1, 103680])],None,151No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.035482168197631836, timestamp=1590743689.6187103) [('xy_split', [-1, 2160])],None,90No: 6 GFLOPS: 160.59/160.59 result: MeasureResult(costs=(0.000658552,), error_no=0, all_cost=1.1336748600006104, timestamp=1590743690.9209) [('xy_split', [-1, 80])],None,29No: 7 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03384804725646973, timestamp=1590743690.3713021) [('xy_split', [-1, 10800])],None,121No: 8 GFLOPS: 155.19/160.59 result: MeasureResult(costs=(0.0006814266,), error_no=0, all_cost=1.0592899322509766, timestamp=1590743691.4001758) [('xy_split', [-1, 240])],None,47No: 9 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.02752852439880371, timestamp=1590743691.6169024) [('xy_split', [-1, 1280])],None,79No: 10 GFLOPS: 141.10/160.59 result: MeasureResult(costs=(0.0007495002,), error_no=0, all_cost=1.0822956562042236, timestamp=1590743692.7331798) [('xy_split', [-1, 405])],None,57Finish loading 100 recordsTime cost of this operator: 0.000554
4,小结
对最终的结果,GPU比CPU快的不是特别明显,所以有点存疑,可能是我优化的不到位,或者其他原因。欢迎一起探索。
最后附上完整的python文件:
