Simultaneous Multithreading (SMT)

Definition

Technique that permits multiple independent threads to issue multiple instructions to a superscalar processor’s functional units in a single cycle.

Traditional processors suffer from two types of waste:

Horizontal Waste: Empty issue slots in a non-idle cycle
Vertical Waste: Completely idle cycles

The key innovation is that SMT allows different threads to simultaneously share processor resources.

Performance

In Tullsen, et al. 1996, they evaluated three SMT models:

Full Simultaneous Issue: all threads compete for all issue slots. This was highest performance but also the most complex.
Single/Dual/Four Issue: limited per-thread issue bandwidth. This had a simpler hardware implementation.
Limited Connection: Each hardware context connects to specific functional units. This was the simplest, but also the slowest.

SMT is able to achieve higher throughput than other alternatives, including the simple models.

Architectural Modifications

Multiple program counters
Per-thread return stacks for branch prediction
Per-thread instruction retirement and trap mechanisms
Thread IDs in branch target buffer entries
Large register file (most significant change)
- Requires pipelined access, extra cycles

Critical Innovation: ICOUNT Fetch policy:

ICOUNT: Prioritize threads with fewest instructions in decode, rename, and queues
23% gain over round-robin thread selection
Significantly reduced instruction queue congestion
Dynamically favors threads using resources efficiently

Fetch Partitioning Strategies

Key Technical Challenges

Instruction fetch:

Thread selection policies: ICOUNT performance improvement up to 37% vs. round-robin

Challenges:
- Must support logical registers for all threads
- Additional physical registers for renaming
- Access time becomes critical with large register files
Solutions:
- Pipeline register access (multi-cycle read/write)
- Optimize thread count based on physical register count
- For 200 physical registers, 4 threads was optimal

Cache Hierarchy:

Private vs. Shared Caches:
- Shared caches: Better for few threads
- Private caches: Better for many threads
Best configuration:
- Private instruction caches
- Shared data cache
- This balances isolation and efficient sharing

Branch Prediction:

Challenges:
- Multiple threads compete for branch prediction resources
- Misprediction rates increase with thread count
Observations:
- SMT is less sensitive to branch mispredictions
- Perfect branch prediction: 25% improvement on 1 thread vs 9% improvement on 8 threads
- SMT Reduces wrong-path instructions from 16% to 9%.

Commercial Implemented

Intel Hyperthreading

First appearance: Pentium 4 (Northwood) in 2002. Uses a 2-way SMT (two logical processors per core). Used duplicated architectural state registers, shared execution resources, caches, and TLBs. Some resources statically partitioned.

Currently enabled in all modern Intel CPUs (up to 2-way SMT).

AMD SMT

Introduced in the Zen Architecture in 2017. Uses a 2-way SMT (similar to Intel’s). Balanced resource sharing and efficient instruction scheduling.

Currently enabled in Ryzen, EPYC, and Threadripper products.

Modern SMT Considerations

Power and Efficiency

Advantages:

Improves performance/watt by better utilizing existing resources
Higher throughput without proportional power increase

Challenges:

Dynamic power increases from more active transistors
Complex power management is required

Solutions:

Power-aware thread scheduling
Dynamic resource partitioning

Security

Vulnerabilities:

Side-channel attacks (e.g., Spectre/Meltdown variants)
Resource sharing enables cross-thread information leakage

Mitigations:

OS-level scheduling constraints
Hardware partitioning of critical resources
Option to disable SMT for security-critical applications

OS Scheduling

Thread placement matters for performance

Modern OSes are SMT-aware:

Linux: CPU affinity, thread grouping
WIndows: logical processor groups
macOS: thread affinity APIs

Some workloads benefit from disabling SMT and using physical cores only.

Key Takeaways

SMT significantly improves processor resource utilization with modest hardware changes. The register file is the most challenging resource to scale with SMT. Intelligent fetch policies like ICOUNT provide substantial performance improvements.

Commercial implementations usually use 2-way SMT for clients, and 8-way for servers. SMT complements multicore designs, improving both throughput and utilization.

Best cache organization: Private instruction caches with shared data caches.

Fetch throughput remains a primary bottleneck.

Notes

Explorer