Data Hazards

Recall our stalling policy for Data Hazards. This works as a general strategy, but it’s not great for performance.

Why isn’t it best for performance?

  • The values that are needed by the dependent instruction are often available in pipeline registers before they’re placed in the locations from where they are retrieved

What if?

  • We bypass waiting for the register updates and hand off the correct bit values to allow the dependent instruction to move forward

The cost of this?

  • More wires, logic, etc. (hardware)

Benefits?

  • Correctness with a lower performance hit

Enter forwarding

Value Forwarding

Policy: Value forwarding (aka “forwarding” or “register forwarding” or “bypassing”)

Mechanism: Logic, wires

  • Focus on data dependences carried through architectural registers
  • Forwarding happens completely in the data path, without any involvement of the pipeline control

The idea is to transmit the bits of over wires during a single cycle

  • from instruction : from the end of the combinational logic block in the X stage, equivalent to M_in, or from M_out, or from W->in, or from W->out
  • to intercept instruction : at the end of the combinational logic block in the D stage, just after incorrect values have been read for the registers
  • so that the correct bits are presented at the clock’s edge to be latched into the X pipeline register, ready for the next cycle

Def-Use Hazard With Forwarding

CycleInstruction inFDXMW
1I1I1
2I2I2I1
3I2I1
4I2I1
5I2I1
6I2

Would have lost 3 cycles in this case without using forwarding (the dependent instruction is separated by 0 instructions, so 3 - 0 = 3)

CycleInstruction inFDXMW
1I1I1
2N1N1I1
3I2I2N1I1
4I2N1I1
5I2N1I1
6I2N1
7I2

Would have lost 2 cycles in this case without using forwarding (the dependent instruction is separated by 1 instruction, so 3 - 1 = 2)

CycleInstruction inFDXMW
1I1I1
2N1N1I1
3N2N2N1I1
4I2I2N2N1I1
5I2N2N1I1
6I2N2N1
7I2N2
8I2

Still lost 0 cycles. Even though we have 2 dependent instructions.

Back-to-Back Load-Use Hazard with Forwarding

Pipeline Performance

Suppose we are processing a long sequence of instructions on the PIPE implementation.

Ignoring pipeline startup and draining cycles, how many cycles will this take?

Clearly, . Let with . Then, cycles per instruction (CPI) is

The penalty term is .

Approximating the Penalty Term

Approximate the penalty term based on causes.

  • Load penalty : 1 cycle for every (back-to-back) load use hazard
  • Misprediction penalty : 2 cycles for every mispredicted branch
  • Return penalty : 1 cycle for every ret instruction.

  • Estimate the penalty terms based on the instruction frequencies and condition frequencies in the execution trace

CPI .