Lecture 2: Three forms of parallelism - 《Programming Parallel Computers》

Different scales

Multicore parallelism: multiple threads
• CPU has got multiple streams of instructions to process (“threads”)
• each core can do useful work
Instruction-level parallelism: independent operations
• each CPU core processes its instruction stream as fast as possible
• all arithmetic units can do useful work in every clock cycle
Vector operations: vector instructions
• each instruction does multiple similar operations in parallel
• all “lanes” of arithmetic units do useful work

Different scales

Multicore parallelism:
• very coarse-grained
• executing e.g. entire subroutines in parallel
• amount of work per independent unit:
e.g. 1 million multiplications
Instruction-level parallelism:
• very fine-grained
• executing machine language instructions in parallel
• amount of work per independent unit:
e.g. 1 multiplication

Loop scheduling of openMP

OpenMP: multithreading made easy

Whenever you ask OpenMP to parallelize a for loop, it is your responsibility to make sure it is safe.

For example, parallelizing the following loop is safe if you can safely execute the operations c(0),c(1), …,c(9)simultaneously in parallel:

Vector instructions

When we do not have any bottlenecks, a modern CPU is typically able to execute 3–4 machine language instructions per clock cycle per core. Nevertheless, the CPU that we use in our examples is able to perform up to 32 floating point operations per clock cycle per core. To achieve this, we will clearly need to do a lot of work with one instruction.

Hence there is a lot of computing power available, but this does not come for free:

We have to reorganize our algorithm so that we can exploit elementwise operations. This naturally requires that there are lots of similar, independent operations that we need to run at each step.
We have to instruct the compiler to generate code that makes use of the vector registers and vector instructions.

Some terminology

In AVX, there are 16 vector registers that are 256-bit wide, and there are instructions that can use such registers e.g. as a vector of 8 floats or as a vector of 4 doubles.

Warning: proper memory alignment needed

Vectors of doubles

Instead of using vector registers to store 8 floats, we can equally well use them to store 4 doubles. The correct definition for such a type is:

typedef double double4_t __attribute__ ((vector_size (4 * sizeof(double))));