Data Hazards
Recall our stalling policy for Data Hazards. This works as a general strategy, but it’s not great for performance.
Why isn’t it best for performance?
- The values that are needed by the dependent instruction are often available in pipeline registers before they’re placed in the locations from where they are retrieved
What if?
- We bypass waiting for the register updates and hand off the correct bit values to allow the dependent instruction to move forward
The cost of this?
- More wires, logic, etc. (hardware)
Benefits?
- Correctness with a lower performance hit
Enter forwarding
Value Forwarding
Policy: Value forwarding (aka “forwarding” or “register forwarding” or “bypassing”)
Mechanism: Logic, wires
- Focus on data dependences carried through architectural registers
- Forwarding happens completely in the data path, without any involvement of the pipeline control
The idea is to transmit the bits of over wires during a single cycle
- from instruction : from the end of the combinational logic block in the X stage, equivalent to
M_in
, or fromM_out
, or fromW->in
, or fromW->out
- to intercept instruction : at the end of the combinational logic block in the D stage, just after incorrect values have been read for the registers
- so that the correct bits are presented at the clock’s edge to be latched into the X pipeline register, ready for the next cycle
Def-Use Hazard With Forwarding
Cycle | Instruction in | F | D | X | M | W |
---|---|---|---|---|---|---|
1 | I1 | I1 | ||||
2 | I2 | I2 | I1 | |||
3 | I2 | I1 | ||||
4 | I2 | I1 | ||||
5 | I2 | I1 | ||||
6 | I2 |
Would have lost 3 cycles in this case without using forwarding (the dependent instruction is separated by 0 instructions, so 3 - 0 = 3)
Cycle | Instruction in | F | D | X | M | W |
---|---|---|---|---|---|---|
1 | I1 | I1 | ||||
2 | N1 | N1 | I1 | |||
3 | I2 | I2 | N1 | I1 | ||
4 | I2 | N1 | I1 | |||
5 | I2 | N1 | I1 | |||
6 | I2 | N1 | ||||
7 | I2 |
Would have lost 2 cycles in this case without using forwarding (the dependent instruction is separated by 1 instruction, so 3 - 1 = 2)
Cycle | Instruction in | F | D | X | M | W |
---|---|---|---|---|---|---|
1 | I1 | I1 | ||||
2 | N1 | N1 | I1 | |||
3 | N2 | N2 | N1 | I1 | ||
4 | I2 | I2 | N2 | N1 | I1 | |
5 | I2 | N2 | N1 | I1 | ||
6 | I2 | N2 | N1 | |||
7 | I2 | N2 | ||||
8 | I2 |
Still lost 0 cycles. Even though we have 2 dependent instructions.
Back-to-Back Load-Use Hazard with Forwarding
Pipeline Performance
Suppose we are processing a long sequence of instructions on the PIPE implementation.
Ignoring pipeline startup and draining cycles, how many cycles will this take?
Clearly, . Let with . Then, cycles per instruction (CPI) is
The penalty term is .
Approximating the Penalty Term
Approximate the penalty term based on causes.
- Load penalty : 1 cycle for every (back-to-back) load use hazard
- Misprediction penalty : 2 cycles for every mispredicted branch
- Return penalty : 1 cycle for every
ret
instruction.
- Estimate the penalty terms based on the instruction frequencies and condition frequencies in the execution trace
CPI .