Graphics Processors

GPUs

Graphics processing units use their own microarchitecture to orchestrate and perform operations over data. There are very specific ways to use these processors, since they are highly parallel.

The most common library for GPUs is NVIDIA’s CUDA.

The Graphics Processing Cluster

Structure of a Streaming Multiprocessor (SM)

Type of Memory	Functionality
Instruction cache	Similar to an i-cache
L1 cache	Stores regular data
Texture cache	Stores texture information
Constant cache	Stores constants
Shared memory	A shared memory that all the threads in the SM can access. Typically, the shared identifier is used to store arrays in this region.

Warps

On a given unit, we can process a TON of threads at once:

$6 GPC \times 7 \frac{TPC}{GPC} \times 2 \frac{SM}{TPC} \times_{4} \frac{PB}{SM} \times 16 \frac{threads}{PB} = 5367 threads .$

We can’t afford to put a Program Counter (PC) on every thread, that would take up too much memory bandwidth and require a rework of memory systems. Instead, we can create a warp:

Groups of 32 threads
Give them all the same PC
Make them execute in lockstep — this is important, if one thread stalls, the others do, too.

This is analogous to a vector instruction with $V L = 32, 16$ hardware lanes and $chime = 2$ clock cycles.

Code with Conditional Branches

Deadlock

Deadlock is possible because of the concept of lockstep. Assume we have 2 threads. If we have one wait on a variable $x$ to be updated to 1, the else part of the conditional won’t be executed because of lockstep. All the threads will be stuck waiting at the same PC.

Pipeline Structure

Register File

The register file is massive. We access 32 registers in a single go (warps are 32 threads). As a result, we need to read $32 \times 32 = 1024$ bits (128B). This means that we need a very very high register reading bandwidth.

The simple solution is to make each entry in a register file bank 1024 bits wide, but this is impractical.

Instead, the better solution would be to distribute the bytes across the banks in the register file. Then, we can use a dedicated set of operand collectors to collect the values of all the registers from different banks. We can have a small on-chip network to route register values between the banks and the operand collector.

Notes

Explorer