5.7。 ROC Ufuncs 和广义 Ufuncs

原文: http://numba.pydata.org/numba-doc/latest/roc/ufunc.html

此页面描述了类似 ROC ufunc 的对象。

为了支持 ROC 程序的编程模式,ROC Vectorize 和 GUVectorize 不能生成传统的 ufunc。相反,返回类似 ufunc 的对象。此对象是一个非常模拟但与常规 NumPy ufunc 不完全兼容的对象。 ROC ufunc 增加了对传递设备内阵列(已在 GPU 设备上)的支持,以减少 PCI-express 总线上的流量。它还接受<cite>流</cite>关键字以在异步模式下启动。

5.7.1。基本 ROC UFunc 示例

  1. import math
  2. from numba import vectorize
  3. import numpy as np
  4. @vectorize(['float32(float32, float32, float32)',
  5. 'float64(float64, float64, float64)'],
  6. target='roc')
  7. def roc_discriminant(a, b, c):
  8. return math.sqrt(b ** 2 - 4 * a * c)
  9. N = 10000
  10. dtype = np.float32
  11. # prepare the input
  12. A = np.array(np.random.sample(N), dtype=dtype)
  13. B = np.array(np.random.sample(N) + 10, dtype=dtype)
  14. C = np.array(np.random.sample(N), dtype=dtype)
  15. D = roc_discriminant(A, B, C)
  16. print(D) # print result

5.7.2。从 ROC UFuncs 调用设备功能

所有 ROC ufunc 内核都能够调用其他 ROC 设备函数:

  1. from numba import vectorize, roc
  2. # define a device function
  3. @roc.jit('float32(float32, float32, float32)', device=True)
  4. def roc_device_fn(x, y, z):
  5. return x ** y / z
  6. # define a ufunc that calls our device function
  7. @vectorize(['float32(float32, float32, float32)'], target='roc')
  8. def roc_ufunc(x, y, z):
  9. return roc_device_fn(x, y, z)

5.7.3。广义 ROC ufuncs

可以使用 ROC 在 GPU 上执行广义 ufunc,类似于 ROC ufunc 功能。这可以通过以下方式完成:

  1. from numba import guvectorize
  2. @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
  3. '(m,n),(n,p)->(m,p)', target='roc')
  4. def matmulcore(A, B, C):
  5. ...

也可以看看

矩阵乘法示例

5.7.4。异步执行:一次一个块

将数据分区为块允许计算和内存传输重叠。这可以提高 ufunc 的吞吐量,并使您的 ufunc 能够处理大于 GPU 内存容量的数据。例如:

  1. import math
  2. from numba import vectorize, roc
  3. import numpy as np
  4. # the ufunc kernel
  5. def discriminant(a, b, c):
  6. return math.sqrt(b ** 2 - 4 * a * c)
  7. roc_discriminant = vectorize(['float32(float32, float32, float32)'],
  8. target='roc')(discriminant)
  9. N = int(1e+8)
  10. dtype = np.float32
  11. # prepare the input
  12. A = np.array(np.random.sample(N), dtype=dtype)
  13. B = np.array(np.random.sample(N) + 10, dtype=dtype)
  14. C = np.array(np.random.sample(N), dtype=dtype)
  15. D = np.zeros(A.shape, dtype=A.dtype)
  16. # create a ROC stream
  17. stream = roc.stream()
  18. chunksize = 1e+6
  19. chunkcount = N // chunksize
  20. # partition numpy arrays into chunks
  21. # no copying is performed
  22. sA = np.split(A, chunkcount)
  23. sB = np.split(B, chunkcount)
  24. sC = np.split(C, chunkcount)
  25. sD = np.split(D, chunkcount)
  26. device_ptrs = []
  27. # helper function, async requires operation on coarsegrain memory regions
  28. def async_array(arr):
  29. coarse_arr = roc.coarsegrain_array(shape=arr.shape, dtype=arr.dtype)
  30. coarse_arr[:] = arr
  31. return coarse_arr
  32. with stream.auto_synchronize():
  33. # every operation in this context with be launched asynchronously
  34. # by using the ROC stream
  35. dchunks = [] # holds the result chunks
  36. # for each chunk
  37. for a, b, c, d in zip(sA, sB, sC, sD):
  38. # create coarse grain arrays
  39. asyncA = async_array(a)
  40. asyncB = async_array(b)
  41. asyncC = async_array(c)
  42. asyncD = async_array(d)
  43. # transfer to device
  44. dA = roc.to_device(asyncA, stream=stream)
  45. dB = roc.to_device(asyncB, stream=stream)
  46. dC = roc.to_device(asyncC, stream=stream)
  47. dD = roc.to_device(asyncD, stream=stream, copy=False) # no copying
  48. # launch kernel
  49. roc_discriminant(dA, dB, dC, out=dD, stream=stream)
  50. # retrieve result
  51. dD.copy_to_host(asyncD, stream=stream)
  52. # store device pointers to prevent them from freeing before
  53. # the kernel is scheduled
  54. device_ptrs.extend([dA, dB, dC, dD])
  55. # store result reference
  56. dchunks.append(asyncD)
  57. # put result chunks into the output array 'D'
  58. for i, result in enumerate(dchunks):
  59. sD[i][:] = result[:]
  60. # data is ready at this point inside D
  61. print(D)