TVM笔记(1) - 知乎
创建时间: | 2020-06-05 23:03 |
---|---|
更新时间: | 2020-06-06 22:24 |
标签: | TVM |
来源: | https://zhuanlan.zhihu.com/p/144488522 |
接下来,用一个实际的例子来尝试一下上文的流程。这里我选一个颜色空间转换,RGB转YUV作为例子,这个是在视觉pipeline里面常见的。
下面代码用到的环境,及一些常量的定义:
w=1920
h=1080
c=3
target = 'llvm'
硬件环境: Jetson Xavier
1,默认状态下的性能
首先用TE(Tensor Expression)描述计算过程:
def rgb2y(i,j,k):
return rgb[i,j,0]*0.257+rgb[i,j,1]*0.504+rgb[i,j,2]*0.098
def rgb2u(i,j,k):
return rgb[i,j,0]*-0.148+rgb[i,j,1]*-0.291+rgb[i,j,2]*0.439
def rgb2v(i,j,k):
return rgb[i,j,0]*0.439+rgb[i,j,1]*-0.368+rgb[i,j,2]*-0.071
def rgb_yuv_schedule(w, h, c=3):
rgb = te.placeholder((w,h,c), name='rgb')
rgb2yuv=lambda i,j,k:te.if_then_else(k==0,rgb2y(i,j,k), te.if_then_else(k==1,rgb2u(i,j,k),rgb2v(i,j,k)))
yuv = te.compute((w,h,c), rgb2yuv, name='yuv')
s = te.create_schedule(yuv.op)
return s, [rgb, yuv]
- •
te.placeholder
定义了输入 •
te.compute
定义了计算规则, 关键的两个参数- • 第一个是shape,定义计算结果的纬度
- • 第二个是fcompute,即一个计算函数,根据shape,输出的每个点都会调用一次fcompute
fcomput里面用到的运算,都得是TVM支持的,比如# te.compute(shape, fcompute)对应于上述rgb2yuv例子的伪代码
for i,j,k in shape:
result[i,j,k] = fcompute(i,j,k)
if_then_else
。上面rgb2yuv的lambda写得比较搓,如果有更优雅的写法,欢迎指教。代码虽然搓了点,但是最终经过编译器的对常量的优化,性能上不会有什么影响。
首先是用于benchmark的代码,借鉴自Dive into Deep Learning Compiler:
接下来实际的跑看看:def bench_workload(workload):
workload(1) # warmup
time = workload(1) # the time to run once
if time > 1: return time
# The number of repeats to measure at least 1 second
num_repeats = max(int(1.0 / time), 5)
return workload(num_repeats) / num_repeats
说明:def bench_tvm_func(func, name='rgb2yuv'):
a_np = np.random.uniform(size=(w, h, c)).astype(np.float32)
b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)
c_np = a_np.dot(b_np)
ctx = tvm.context(target, 0)
a_tvm = tvm.nd.array(a_np, ctx=ctx)
c_tvm = tvm.nd.empty(c_np.shape, ctx=ctx)
func(a_tvm, c_tvm)
tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)
def workload(nrepeats):
evaluator = func.time_evaluator(func.entry_name, ctx, number=400)
return evaluator(a_tvm, c_tvm).mean * nrepeats
print('Time cost of this operator: %f' % bench_workload(workload))
func.save('/tmp/%s.o'%name)
def default_schedule():
s, (rgb,yuv) = rgb_yuv_schedule(w, h)
if target == 'opencl' or target == 'cuda':
bx, tx = s[yuv].split(yuv.op.axis[1], factor=64)
s[yuv].bind(bx, te.thread_axis("blockIdx.x"))
s[yuv].bind(tx, te.thread_axis("threadIdx.x"))
func = tvm.build(s, [rgb,yuv], target=target, name='rgb2yuv')
bench_tvm_func(func, name='rgb2yuv_default')
default_schedule()
• tvm.build把TE定义的计算编译为针对target的实际运行的代码
- • 编译出来的func,函数原型func(*args_buf),即func(rgb, yuv)
- • tvm.nd.array和Numpy用法基本一致,context用于制定内存是位于CPU,还是GPU
- • target为GPU,因为CUDA对线程数有限制,因此做了个split
在Jetson Xavier上的结果,2.7ms:
Time cost of this operator: 0.002705
作为对比,用Numpy实现,并benchmark:
import numpy as np
# check correctness
timer = timeit.Timer(setup='import numpy as np\n'
'a_np = np.random.uniform(size=(%d, %d, %d)).astype(np.float32)\n'
'b_np = np.array([[0.257,-0.148,0.439],[0.504,-0.291,-0.368],[0.098,0.439,-0.071]]).astype(np.float32)'%(w,h,c)
, stmt='c_np = a_np.dot(b_np)')
print('time for np.dot((%d, %d)):'%(w,h), bench_workload(timer.timeit))
在我环境里,这个值120ms,大的过分,怀疑是python一些处理的overhead,没有深究。
2,CPU上应用autotuning:
@autotvm.template("multimedia/rgb_yuv_trans")
def rgb_yuv_trans(w, h, c=3):
s,(rgb,yuv) = rgb_yuv_schedule(w, h, c)
x, y = yuv.op.axis[0], yuv.op.axis[1]
cfg = autotvm.get_config()
if target == 'opencl' or target == 'cuda':
xy = s[yuv].fuse(x, y)
#### define space begin ####
cfg.define_split("xy_split", cfg.axis(x.dom.extent.value * y.dom.extent.value), num_outputs=2)
#### define space end ####
bx, tx = cfg["xy_split"].apply(s, yuv, xy)
s[yuv].bind(bx, te.thread_axis("blockIdx.x"))
s[yuv].bind(tx, te.thread_axis("threadIdx.x"))
s[yuv].unroll(yuv.op.axis[2])
else:
#### define space begin ####
cfg.define_split("tile_y", y, num_outputs=2)
#### define space end ####
yo, yi = cfg["tile_y"].apply(s, yuv, y)
xy = s[yuv].fuse(x, yo)
s[yuv].parallel(xy)
s[yuv].vectorize(yi)
s[yuv].unroll(yuv.op.axis[2])
return s, [rgb, yuv]
说明:
- • 通过autotvm.template这个decorator,指明这是一个可tune的函数
- • xy_split的定义,直接使用了cfg.axis,因为TVM的限制,直接放xy会出错
- • 这里针对cpu和gpu,分别进行了支持
训练并评估:
def tune_task():
task = autotvm.task.create("multimedia/rgb_yuv_trans", args=(w, h), target=target)
print(task.config_space)
measure_option = autotvm.measure_option(builder='local', runner=autotvm.LocalRunner(number=5))
tuner=autotvm.tuner.RandomTuner(task)
if 0:
n_trial = len(task.config_space)
else:
n_trial = 10
tuner.tune(n_trial=n_trial, measure_option=measure_option, callbacks=[autotvm.callback.log_to_file('rgb2yuv.log')])
def evaluate_task():
# apply history best from log file
with autotvm.apply_history_best('rgb2yuv.log'):
with tvm.target.create(target):
s, arg_bufs = rgb_yuv_trans(w, h)
func = tvm.build(s, arg_bufs, name='rgb2yuv')
bench_tvm_func(func)
import logging
import sys
logging.getLogger('autotvm').setLevel(logging.DEBUG)
logging.getLogger('autotvm').addHandler(logging.StreamHandler(sys.stdout))
tune_task()
evaluate_task()
为了跑得快一点,n_trial设置为10的输出结果:
ConfigSpace (len=32, space_map=
0 tile_y: Split(policy=factors, product=1080, num_outputs=2) len=32
)
Get devices for measurement successfully!
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7626786) [('tile_y', [-1, 540])],None,30
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.7632723) [('tile_y', [-1, 216])],None,27
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735018.787355) [('tile_y', [-1, 270])],None,28
No: 4 GFLOPS: 25.26/25.26 result: MeasureResult(costs=(0.004187388,), error_no=0, all_cost=8.747061729431152, timestamp=1590735019.2084217) [('tile_y', [-1, 180])],None,26
No: 5 GFLOPS: 74.88/74.88 result: MeasureResult(costs=(0.001412358,), error_no=0, all_cost=0.41734933853149414, timestamp=1590735019.6017284) [('tile_y', [-1, 4])],None,3
No: 6 GFLOPS: 66.83/74.88 result: MeasureResult(costs=(0.0015824143999999998,), error_no=0, all_cost=0.4230306148529053, timestamp=1590735019.9837952) [('tile_y', [-1, 5])],None,4
No: 7 GFLOPS: 91.06/91.06 result: MeasureResult(costs=(0.0011614144,), error_no=0, all_cost=0.5114855766296387, timestamp=1590735020.364343) [('tile_y', [-1, 12])],None,9
No: 8 GFLOPS: 89.31/91.06 result: MeasureResult(costs=(0.0011841097999999999,), error_no=0, all_cost=0.38067150115966797, timestamp=1590735020.7485013) [('tile_y', [-1, 3])],None,2
No: 9 GFLOPS: 0.00/91.06 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1590735030.9819949) [('tile_y', [-1, 360])],None,29
No: 10 GFLOPS: 71.80/91.06 result: MeasureResult(costs=(0.0014729818,), error_no=0, all_cost=0.6799407005310059, timestamp=1590735031.3066556) [('tile_y', [-1, 24])],None,13
Finish loading 90 records
Time cost of this operator: 0.000682
3,GPU上的结果
Time cost of this operator: 0.009534
Get devices for measurement successfully!
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.0365450382232666, timestamp=1590743689.5894103) [('xy_split', [-1, 3072])],None,97
No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03839516639709473, timestamp=1590743689.5899727) [('xy_split', [-1, 64800])],None,146
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.039269208908081055, timestamp=1590743689.618169) [('xy_split', [-1, 57600])],None,145
No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/pytho
n/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03884124755859375, timestamp=1590743689.6184874) [('xy_split', [-1, 103680])],None,151
No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.035482168197631836, timestamp=1590743689.6187103) [('xy_split', [-1, 2160])],None,90
No: 6 GFLOPS: 160.59/160.59 result: MeasureResult(costs=(0.000658552,), error_no=0, all_cost=1.1336748600006104, timestamp=1590743690.9209) [('xy_split', [-1, 80])],None,29
No: 7 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.03384804725646973, timestamp=1590743690.3713021) [('xy_split', [-1, 10800])],None,121
No: 8 GFLOPS: 155.19/160.59 result: MeasureResult(costs=(0.0006814266,), error_no=0, all_cost=1.0592899322509766, timestamp=1590743691.4001758) [('xy_split', [-1, 240])],None,47
No: 9 GFLOPS: 0.00/160.59 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (4) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f7a3d0160]\n [bt] (3) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0x522468) [0x7f79bb9468]\n [bt] (2) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x308) [0x7f79bb80c8]\n [bt] (1) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(tvm::tir::transform::PrimFuncPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x850) [0x7f79d9f290]\n [bt] (0) /media/2afa5f4f-ac3c-4df5-ac77-8b743727d671/workspace/incubator-tvm/build/libtvm.so(+0xd354c8) [0x7f7a3cc4c8]\n File "tvm/_ffi/_cython/./packed_func.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback\n File "/home/administrator/workspace/incubator-tvm/python/tvm/autotvm/measure/measure_methods.py", line 622, in verify_pass\n raise InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.02752852439880371, timestamp=1590743691.6169024) [('xy_split', [-1, 1280])],None,79
No: 10 GFLOPS: 141.10/160.59 result: MeasureResult(costs=(0.0007495002,), error_no=0, all_cost=1.0822956562042236, timestamp=1590743692.7331798) [('xy_split', [-1, 405])],None,57
Finish loading 100 records
Time cost of this operator: 0.000554
4,小结
对最终的结果,GPU比CPU快的不是特别明显,所以有点存疑,可能是我优化的不到位,或者其他原因。欢迎一起探索。
最后附上完整的python文件: