GPU Arrays
Vector Types
The GPUArray Array Class
Constructing GPUArray Instances
Conditionals
Reductions
Elementwise Functions on GPUArray Instances
The pycuda.cumath module contains elementwise workalikes for the functions contained in math.
Rounding and Absolute Value
Exponentials, Logarithms and Roots
Trigonometric Functions
Hyperbolic Functions
Floating Point Decomposition and Assembly
Generating Arrays of Random Numbers
warning
The following classes are using random number generators that run on the GPU. Each thread uses its own generator. Creation of those generators requires more resources than subsequent generation of random numbers. After experiments it looks like maximum number of active generators on Tesla devices (with compute capabilities 1.x) is
- Fermi devices allow for creating 1024 generators without any problems. If there are troubles with creating objects of class PseudoRandomNumberGenerator or QuasiRandomNumberGenerator decrease number of created generators (and therefore number of active threads).
A pseudorandom sequence of numbers satisfies most of the statistical properties of a truly random sequence but is generated by a deterministic algorithm. A quasirandom sequence of n-dimensional points is generated by a deterministic algorithm designed to fill an n-dimensional space evenly.
Quasirandom numbers are more expensive to generate.
Single-pass Custom Expression Evaluation
Evaluating involved expressions on GPUArray instances can be somewhat inefficient, because a new temporary is created for each intermediate result. The functionality in the module pycuda.elementwise contains tools to help generate kernels that evaluate multi-stage expressions on one or several operands in a single pass.
Here’s a usage example:
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
"float a, float *x, float b, float *y, float *z",
"z[i] = a*x[i] + b*y[i]",
"linear_combination")
c_gpu = gpuarray.empty_like(a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
import numpy.linalg as la
assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
(You can find this example as examples/demo_elementwise.py in the PyCuda distribution.)
Custom Reductions
Here’s a usage example:
a = gpuarray.arange(400, dtype=numpy.float32)
b = gpuarray.arange(400, dtype=numpy.float32)
krnl = ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
my_dot_prod = krnl(a, b).get()
Parallel Scan / Prefix Sum
Here’s a usage example:
knl = InclusiveScanKernel(np.int32, "a+b")
n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = gpuarray.to_gpu(queue, host_data)
knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
Custom data types in Reduction and Scan
If you would like to use your own (struct/union/whatever) data types in scan and reduction, define those types in the preamble and let PyCUDA know about them using this function:
GPGPU Algorithms
Bogdan Opanchuk’s reikna offers a variety of GPU-based algorithms (FFT, RNG, matrix multiplication) designed to work with pycuda.gpuarray.GPUArray objects.