GPU Arrays

Vector Types

The GPUArray Array Class

Constructing GPUArray Instances

Conditionals

Reductions

Elementwise Functions on GPUArray Instances

The pycuda.cumath module contains elementwise workalikes for the functions contained in math.

Rounding and Absolute Value

Exponentials, Logarithms and Roots

Trigonometric Functions

Hyperbolic Functions

Floating Point Decomposition and Assembly

Generating Arrays of Random Numbers

warning

The following classes are using random number generators that run on the GPU. Each thread uses its own generator. Creation of those generators requires more resources than subsequent generation of random numbers. After experiments it looks like maximum number of active generators on Tesla devices (with compute capabilities 1.x) is

  1. Fermi devices allow for creating 1024 generators without any problems. If there are troubles with creating objects of class PseudoRandomNumberGenerator or QuasiRandomNumberGenerator decrease number of created generators (and therefore number of active threads).

A pseudorandom sequence of numbers satisfies most of the statistical properties of a truly random sequence but is generated by a deterministic algorithm. A quasirandom sequence of n-dimensional points is generated by a deterministic algorithm designed to fill an n-dimensional space evenly.

Quasirandom numbers are more expensive to generate.

Single-pass Custom Expression Evaluation

Evaluating involved expressions on GPUArray instances can be somewhat inefficient, because a new temporary is created for each intermediate result. The functionality in the module pycuda.elementwise contains tools to help generate kernels that evaluate multi-stage expressions on one or several operands in a single pass.

Here’s a usage example:

  1. import pycuda.gpuarray as gpuarray
  2. import pycuda.driver as cuda
  3. import pycuda.autoinit
  4. import numpy
  5. from pycuda.curandom import rand as curand
  6. a_gpu = curand((50,))
  7. b_gpu = curand((50,))
  8. from pycuda.elementwise import ElementwiseKernel
  9. lin_comb = ElementwiseKernel(
  10. "float a, float *x, float b, float *y, float *z",
  11. "z[i] = a*x[i] + b*y[i]",
  12. "linear_combination")
  13. c_gpu = gpuarray.empty_like(a_gpu)
  14. lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
  15. import numpy.linalg as la
  16. assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5

(You can find this example as examples/demo_elementwise.py in the PyCuda distribution.)

Custom Reductions

Here’s a usage example:

  1. a = gpuarray.arange(400, dtype=numpy.float32)
  2. b = gpuarray.arange(400, dtype=numpy.float32)
  3. krnl = ReductionKernel(numpy.float32, neutral="0",
  4. reduce_expr="a+b", map_expr="x[i]*y[i]",
  5. arguments="float *x, float *y")
  6. my_dot_prod = krnl(a, b).get()

Parallel Scan / Prefix Sum

Here’s a usage example:

  1. knl = InclusiveScanKernel(np.int32, "a+b")
  2. n = 2**20-2**18+5
  3. host_data = np.random.randint(0, 10, n).astype(np.int32)
  4. dev_data = gpuarray.to_gpu(queue, host_data)
  5. knl(dev_data)
  6. assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()

Custom data types in Reduction and Scan

If you would like to use your own (struct/union/whatever) data types in scan and reduction, define those types in the preamble and let PyCUDA know about them using this function:

GPGPU Algorithms

Bogdan Opanchuk’s reikna offers a variety of GPU-based algorithms (FFT, RNG, matrix multiplication) designed to work with pycuda.gpuarray.GPUArray objects.