5.7。 ROC Ufuncs 和广义 Ufuncs
此页面描述了类似 ROC ufunc 的对象。
为了支持 ROC 程序的编程模式,ROC Vectorize 和 GUVectorize 不能生成传统的 ufunc。相反,返回类似 ufunc 的对象。此对象是一个非常模拟但与常规 NumPy ufunc 不完全兼容的对象。 ROC ufunc 增加了对传递设备内阵列(已在 GPU 设备上)的支持,以减少 PCI-express 总线上的流量。它还接受<cite>流</cite>关键字以在异步模式下启动。
5.7.1。基本 ROC UFunc 示例
import mathfrom numba import vectorizeimport numpy as np@vectorize(['float32(float32, float32, float32)','float64(float64, float64, float64)'],target='roc')def roc_discriminant(a, b, c):return math.sqrt(b ** 2 - 4 * a * c)N = 10000dtype = np.float32# prepare the inputA = np.array(np.random.sample(N), dtype=dtype)B = np.array(np.random.sample(N) + 10, dtype=dtype)C = np.array(np.random.sample(N), dtype=dtype)D = roc_discriminant(A, B, C)print(D) # print result
5.7.2。从 ROC UFuncs 调用设备功能
所有 ROC ufunc 内核都能够调用其他 ROC 设备函数:
from numba import vectorize, roc# define a device function@roc.jit('float32(float32, float32, float32)', device=True)def roc_device_fn(x, y, z):return x ** y / z# define a ufunc that calls our device function@vectorize(['float32(float32, float32, float32)'], target='roc')def roc_ufunc(x, y, z):return roc_device_fn(x, y, z)
5.7.3。广义 ROC ufuncs
可以使用 ROC 在 GPU 上执行广义 ufunc,类似于 ROC ufunc 功能。这可以通过以下方式完成:
from numba import guvectorize@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)', target='roc')def matmulcore(A, B, C):...
也可以看看
5.7.4。异步执行:一次一个块
将数据分区为块允许计算和内存传输重叠。这可以提高 ufunc 的吞吐量,并使您的 ufunc 能够处理大于 GPU 内存容量的数据。例如:
import mathfrom numba import vectorize, rocimport numpy as np# the ufunc kerneldef discriminant(a, b, c):return math.sqrt(b ** 2 - 4 * a * c)roc_discriminant = vectorize(['float32(float32, float32, float32)'],target='roc')(discriminant)N = int(1e+8)dtype = np.float32# prepare the inputA = np.array(np.random.sample(N), dtype=dtype)B = np.array(np.random.sample(N) + 10, dtype=dtype)C = np.array(np.random.sample(N), dtype=dtype)D = np.zeros(A.shape, dtype=A.dtype)# create a ROC streamstream = roc.stream()chunksize = 1e+6chunkcount = N // chunksize# partition numpy arrays into chunks# no copying is performedsA = np.split(A, chunkcount)sB = np.split(B, chunkcount)sC = np.split(C, chunkcount)sD = np.split(D, chunkcount)device_ptrs = []# helper function, async requires operation on coarsegrain memory regionsdef async_array(arr):coarse_arr = roc.coarsegrain_array(shape=arr.shape, dtype=arr.dtype)coarse_arr[:] = arrreturn coarse_arrwith stream.auto_synchronize():# every operation in this context with be launched asynchronously# by using the ROC streamdchunks = [] # holds the result chunks# for each chunkfor a, b, c, d in zip(sA, sB, sC, sD):# create coarse grain arraysasyncA = async_array(a)asyncB = async_array(b)asyncC = async_array(c)asyncD = async_array(d)# transfer to devicedA = roc.to_device(asyncA, stream=stream)dB = roc.to_device(asyncB, stream=stream)dC = roc.to_device(asyncC, stream=stream)dD = roc.to_device(asyncD, stream=stream, copy=False) # no copying# launch kernelroc_discriminant(dA, dB, dC, out=dD, stream=stream)# retrieve resultdD.copy_to_host(asyncD, stream=stream)# store device pointers to prevent them from freeing before# the kernel is scheduleddevice_ptrs.extend([dA, dB, dC, dD])# store result referencedchunks.append(asyncD)# put result chunks into the output array 'D'for i, result in enumerate(dchunks):sD[i][:] = result[:]# data is ready at this point inside Dprint(D)
