https://www.bilibili.com/video/BV1B5411W7TW?from=search&seid=52486649454542281

1 CUDA Python—编程基础以及图像处理.pdf

2 CUDA Python—存储管理以及卷积计算.py
2 CUDA-python—并行计算基础-卷积计算以及共享内存.pdf

3 CUDA-python—多流执行和cuBLAS.ipynb
3 CUDA-python—多流执行和cuBLAS.pdf

day_1

  1. import cv2
  2. import numpy as np
  3. from numba import cuda
  4. import time
  5. import math
  6. # gpu function
  7. @cuda.jit
  8. def process_gpu(img, channels):
  9. tx = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
  10. ty = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
  11. # all in one loop
  12. for c in range(channels):
  13. color = img[tx, ty][c] * 2.0 + 30
  14. if color > 255:
  15. img[tx, ty][c] = 255
  16. elif color < 0:
  17. img[tx, ty][c] = 0
  18. else:
  19. img[tx, ty][c] = color
  20. def process_cpu(img, dst):
  21. rows, cols, channels = img.shape
  22. for i in range(rows):
  23. for j in range(cols):
  24. for c in range(3):
  25. color = img[i,j][c]*2.0 + 30
  26. if color > 255:
  27. dst[i,j][c] = 255
  28. elif color < 0:
  29. dst[i,j][c] = 0
  30. else:
  31. dst[i,j][c] = color
  32. if __name__ == "__main__":
  33. # create an image
  34. img = cv2.imread("dog_test_101.jpg")
  35. rows, cols, channels = img.shape
  36. dst_cpu = img.copy()
  37. dst_gpu = img.copy()
  38. start_cpu = time.time()
  39. process_cpu(img, dst_cpu)
  40. end_cpu = time.time()
  41. print("cpu process time: ", end_cpu - start_cpu)
  42. # gpu function
  43. dImg = cuda.to_device(img)
  44. threadsprblock = (16, 16)
  45. blockspergrid_x = int(math.ceil(rows/threadsprblock[0]))
  46. blockspergrid_y = int(math.ceil(cols/threadsprblock[1]))
  47. blockspergrid = (blockspergrid_x, blockspergrid_y)
  48. cuda.synchronize()
  49. # 同步
  50. start_gpu = time.time()
  51. process_gpu[blockspergrid, threadsprblock](dImg, channels)
  52. cuda.synchronize()
  53. end_gpu = time.time()
  54. dst_gpu = dImg.copy_to_host()
  55. print("gpu process time: ", end_gpu - start_gpu)
  56. # save
  57. cv2.imwrite("result_cpu.jpg", dst_cpu)
  58. cv2.imwrite("result_gpu.jpg", dst_gpu)
  59. print("#### done")

what_is_cuda

image.png
image.png
image.png

异构计算

image.png
image.pngimage.png
SM:
image.png
每16 核 共享一个 解码器/译码器
每32 核 共享一个 context, memory.
image.png
image.png
image.png
image.png
image.pngimage.png
image.png
image.png
image.png
image.png
只读: constant memory, texture memory
其余 读写
image.png

cuda python

image.png
image.pngimage.png

install cuda


image.png
image.png

end