Google’s DNN ASIC. Specifically meant to be amazing at dense matrix-matrix multiplication. 256x256 8-bit unit.
Large software-managed data buffer (scratchpad).
Coprocessor on the PCIe bus.
If you don’t have the ability to do computation with a cache in parallel, you use a systolic algorithm. A massively parallel pipeline/array of stuff with a regular connectivity pattern/clock cycle. Compute, push, communicate, synchronized.
Systolic Operation of TPU
TPU Guidelines
Use dedicated memories: 24 MiB dedicated buffer, 4 MiB accumulator buffers
Invest resources in arithmetic units and dedicated memories: 60% of the memory and 250x the arithmetic units of a server-class CPU
Use the easiest form of parallelism that matches the domain: exploits 2D SIMD parallelism
Reduce the data size