Branch-and-Pass Algorithm (BOSS)
- Branch-and-Pass Algorithm (BOSS) is a coordinated mechanism that precomputes branch outcomes using a compiler-driven back-slice to address hard-to-predict loop branches.
- It employs a static-to-dynamic branch mapping scheme where software pre-execution communicates precise Taken/Not-Taken hints to the hardware, reducing misfetches and pipeline stalls.
- BOSS integrates seamlessly with existing branch predictors and leverages optimizations like loop unrolling and vectorization to achieve significant performance improvements and lowered MPKI.
The Branch-and-Pass algorithm, also referred to as BOSS, is a compiler- and microarchitecture-coordinated mechanism designed to eliminate or drastically reduce mispredictions for hard-to-predict branches, particularly load-dependent branches (LDBs) occurring within software loops. BOSS achieves this by enabling software to pre-execute the minimal instruction back-slice computing the branch condition and passing the corresponding Taken/Not-Taken outcome to the processor frontend before the branch’s dynamic instance is fetched. This side-channel communication enables the hardware to steer instruction fetch along the correct path, thereby obviating misfetches and pipeline squashes associated with conventional dynamic branch prediction. The mechanism integrates seamlessly with extant branch prediction units (BPUs), supplying hint information that supersedes the prediction only when available, with fallback to the baseline predictor otherwise (Goudarzi et al., 2023).
1. Motivations and Architectural Premises
Conventional branch predictors, particularly those built upon local and global history correlations (e.g., TAGE-SC-L), frequently struggle with LDBs, where branch outcomes depend on unpredictable memory loads or irregular input data. These difficulties are magnified in deeply pipelined, wide-issue out-of-order cores where branch mispredictions incur considerable fetch redirection penalties. BOSS specifically targets these LDBs inside loops, leveraging regular loop iteration structure to facilitate clear mapping between static branches and their dynamic instances. By calculating and conveying exact Taken/Not-Taken bits via software, BOSS sidesteps the fundamental limits of pattern-based dynamic prediction for these branches and enables fetch redirection to occur with near-perfect accuracy (Goudarzi et al., 2023).
2. Compiler Analysis and Code Transformation
BOSS requires compiler or profile-guided analysis to instrument target loops and their constituent branches. The transformation includes the following:
- Back-Slice Identification: For each target branch within a loop, the compiler traces the sequence of instructions (the branch back-slice) leading to the branch, proceeding backward along data and control dependencies until reaching sources that are either induction variables or induction-derived memory accesses.
- Iteration Strip-Mining: If the loop trip count exceeds the capacity of the BOSS side-channel buffer, the compiler rewrites the loop into strip-mined blocks of size to maintain one-to-one correspondence between dynamic branch instances and buffer indices.
- Pre-Execute Loop Generation: A pre-execute loop is generated immediately prior to the main loop. For every iteration index , it executes the minimal back-slice computation and emits the outcome via
BOSS_write(channel, i, outcome). - Main Loop Instrumentation: A one-time
BOSS_open(channel, BranchPC, LoopEndPC)call declares the scope, and the conditional branch is left in situ for correctness, exception handling, and adherence to the architectural contract (Goudarzi et al., 2023).
3. Static-to-Dynamic Branch Mapping
Dynamic branch instance mapping in BOSS relies on software/hardware-level coordination, parameterized by:
- Channel Capacity (): Number of concurrent branch outcomes tracked (e.g., ).
- Generation IDs (): Alternating for each complete trip through a loop, ensuring fresh allocation and preventing cross-iteration overwrite conflicts.
- Iteration Indexing: Each dynamic instance is keyed by , where for .
The microarchitecture maintains a per-channel LUT, indexed as , enabling correct and collision-free dynamic lookup as the loop executes (Goudarzi et al., 2023).
4. Microarchitectural Integration
BOSS necessitates minimal extensions to the processor frontend, notably:
- State Tables: Including BranchPC, LoopEndPC, per-channel producer and consumer generation bits, consumer iteration indices, and a simple iterator stack to track control flow squashes.
- LUT for Outcomes: For 4 channels, 2 generations, and , the LUT requires bits, totaling 4 kbits (512 B).
- Pipeline Hooks: On
BOSS_write, outcome bits are committed to the LUT. Fetching a branch looks up the corresponding outcome, and fallback to the BPU occurs if no valid entry exists. Bookkeeping updates generation toggles, entry validation, and stack rollbacks as dictated by commit, fetch, and squash events at the branch and loop endpoints (Goudarzi et al., 2023).
5. Loop-Level Optimizations: Unrolling and Vectorization
Execution overhead for pre-executing the branch back-slice per loop iteration can be substantial, especially for large trip counts. BOSS enables two orthogonal optimizations:
- Unrolling: If the pre-execute loop is unrolled by a factor , outcomes are calculated per loop, reducing the amortized store/fence cost for writing outcomes to the side-channel.
- Vectorization: For side-band-friendly back-slices (i.e., lacking loop-carried dependencies), SIMD-style vectorization computes multiple outcomes in parallel, allowing their emission as a single wide write into the BOSS buffer. As increases, the effective overhead per iteration from store/fence operations diminishes.
- Break-Even Analysis: Net speedup materializes if , where is the iteration count, is the instruction count per back-slice, and is the per-write cost (Goudarzi et al., 2023).
6. Example Transformation and Dynamic Behavior
In a canonical case (e.g., Leela’s kill_neighbours loop from SPEC 2017, inner loop size ):
- BOSS Initialization:
BOSS_openinvoked with loop-specific parameters prior to entry. - Pre-Execute Loop: Computes the address and comparison required for branch outcome and invokes
BOSS_writefor each loop index. - Main Loop: Retains original logic; the branch condition is evaluated as usual. On fetch, the frontend resolves the outcome via the LUT, guaranteeing correct prediction where pre-computed data is available.
The hardware’s branch fetch logic first consults the LUT for a software-provided outcome; if valid, the fetch path is set with zero misprediction. Absent a valid software hint, the system falls back to conventional branch prediction (Goudarzi et al., 2023).
7. Performance Evaluation, Limitations, and Extensions
- Experimental Results: On full-system gem5 simulations (ARM-like 8-issue OoO core, TAGE-SC-L-64KB predictor), BOSS achieves:
- Up to 95% reduction in MPKI at LDB sites (21% on average).
- Up to 3× improvement in IPC (23% on average).
- End-to-end speedups of up to 39% in tight hot spots (7% on average).
- Manual unrolling and vectorization reduce pre-execute overhead by up to 40%.
- Trade-Offs: BOSS is most effective when the branch’s misprediction rate exceeds 10 MPKI; for low-mispredict or small trip count cases, overhead amortization is less favorable.
- Constraints:
- Insufficient pre-execution lead-time in extremely small trip-count loops ().
- Compiler support currently only for single (not nested) inner loop branches.
- Back-slice code overhead is always incurred; runtime skipping of cold branches is feasible with profiling but not dynamic mispredict-rate gating.
- Extensions: The framework supports partial iteration-space coverage, reuse of branch bits across iterations, and cross-loop branch correlation. All extensions are safe due to the hint-only, non-invasive software/hardware contract (Goudarzi et al., 2023).