L1 Instruction Memory System
Fetches instruction from the Instruction cache (I$) and delivers the instruction stream to the instruction decode unit.
It includes:
- a 64 KiB (Kibibyte), 4 way Set Associative L1 I$ with 64B lines
- a fully-associative L1 ITLB with native support for 4 KiB, 16 KiB, 64 KiB, and 2 MiB page sizes
- A 1536-entry 4-way skewed associative L0 Macro-Op (MOP) cache, which contains decoded and optimized instructions for higher performance.
- A dynamic branch predictor: TAGe, branch target buffers, return address stacks, etc.
Macro-Ops (MOPs) and Micro-Ops (µops)
Macro-Op (MOP): An instruction held in a format described by the ISA. Can be slight differences, but basically the same. This is independent of the µ-architecture.
Micro-Op (µop): An internal processor specific encoding of the operations used to control execution resources. This can vary widely between different implementations of the same ISA.
The correspondence between MOPs and µops used by a processor may be 1:1, 1:N, or N:1: A single MOP can be cracked/split into one or more internal µops, or multiple MOPs can be fused into a single µop.
Pre-Decoding
![]() |
---|
![]() |
https://patents.google.com/patent/US10176104B2/en |
The intuition is to identify and signal opportunities to split or fuse instructions into more efficient primitives natively supported by the µarch. |
There is, however, a caveat: we need to preserve the precise exception model while doing this.
Examples
Splits
We can split the instruction
str x1, [x2], #24
into the MOP sequence
AGU x2
ALU x2, #24
STR x1
We can split the instruction
stp x29, x30, [sp, #-32]!
into
ALU sp, #-32
AGU sp
STR x29
AGU sp, #8
STR x30
Fuses
We can fuse the instruction sequence
add x0, x1, x2
add x3, x0, x4
into the single MOP
add r3, r1, r2, r4
(assuming the µarch supports 3 register adds)
We can fuse the instruction sequence
b tgt
add x3, x3, 4
tgt: ...
into
nop_pc+8
We can fuse the instruction
cbnz x1, tgt
add x3, x3, 1
tgt: ...
into
ifeqz_add x3, x3, 1, x1, xzr
Instruction Decoding in ARM Cortex-A78
A MOP can be split into two µops after the decode stage. These µops are dispatched to one of 13 issue pipelines where each pipeline can accept 1 µop/cycle.
Dispatch can process up to 6 MOPs/cycle and dispatch up to 12 µops/cycle. There are limitations on each type of µop that may be simultaneously dispatched. If there are more µops available to be dispatched, they will be dispatched in oldest to youngest age-order.