Papers
Topics
Authors
Recent
2000 character limit reached

Branch-and-Pass Algorithm (BOSS)

Updated 12 December 2025
  • Branch-and-Pass Algorithm (BOSS) is a coordinated mechanism that precomputes branch outcomes using a compiler-driven back-slice to address hard-to-predict loop branches.
  • It employs a static-to-dynamic branch mapping scheme where software pre-execution communicates precise Taken/Not-Taken hints to the hardware, reducing misfetches and pipeline stalls.
  • BOSS integrates seamlessly with existing branch predictors and leverages optimizations like loop unrolling and vectorization to achieve significant performance improvements and lowered MPKI.

The Branch-and-Pass algorithm, also referred to as BOSS, is a compiler- and microarchitecture-coordinated mechanism designed to eliminate or drastically reduce mispredictions for hard-to-predict branches, particularly load-dependent branches (LDBs) occurring within software loops. BOSS achieves this by enabling software to pre-execute the minimal instruction back-slice computing the branch condition and passing the corresponding Taken/Not-Taken outcome to the processor frontend before the branch’s dynamic instance is fetched. This side-channel communication enables the hardware to steer instruction fetch along the correct path, thereby obviating misfetches and pipeline squashes associated with conventional dynamic branch prediction. The mechanism integrates seamlessly with extant branch prediction units (BPUs), supplying hint information that supersedes the prediction only when available, with fallback to the baseline predictor otherwise (Goudarzi et al., 2023).

1. Motivations and Architectural Premises

Conventional branch predictors, particularly those built upon local and global history correlations (e.g., TAGE-SC-L), frequently struggle with LDBs, where branch outcomes depend on unpredictable memory loads or irregular input data. These difficulties are magnified in deeply pipelined, wide-issue out-of-order cores where branch mispredictions incur considerable fetch redirection penalties. BOSS specifically targets these LDBs inside loops, leveraging regular loop iteration structure to facilitate clear mapping between static branches and their dynamic instances. By calculating and conveying exact Taken/Not-Taken bits via software, BOSS sidesteps the fundamental limits of pattern-based dynamic prediction for these branches and enables fetch redirection to occur with near-perfect accuracy (Goudarzi et al., 2023).

2. Compiler Analysis and Code Transformation

BOSS requires compiler or profile-guided analysis to instrument target loops and their constituent branches. The transformation includes the following:

  • Back-Slice Identification: For each target branch within a loop, the compiler traces the sequence of instructions (the branch back-slice) leading to the branch, proceeding backward along data and control dependencies until reaching sources that are either induction variables or induction-derived memory accesses.
  • Iteration Strip-Mining: If the loop trip count NN exceeds the capacity CC of the BOSS side-channel buffer, the compiler rewrites the loop into strip-mined blocks of size CC to maintain one-to-one correspondence between dynamic branch instances and buffer indices.
  • Pre-Execute Loop Generation: A pre-execute loop is generated immediately prior to the main loop. For every iteration index ii, it executes the minimal back-slice computation and emits the outcome via BOSS_write(channel, i, outcome).
  • Main Loop Instrumentation: A one-time BOSS_open(channel, BranchPC, LoopEndPC) call declares the scope, and the conditional branch is left in situ for correctness, exception handling, and adherence to the architectural contract (Goudarzi et al., 2023).

3. Static-to-Dynamic Branch Mapping

Dynamic branch instance mapping in BOSS relies on software/hardware-level coordination, parameterized by:

  • Channel Capacity (CC): Number of concurrent branch outcomes tracked (e.g., C=256C = 256).
  • Generation IDs ({0,1}\{0, 1\}): Alternating for each complete trip through a loop, ensuring fresh allocation and preventing cross-iteration overwrite conflicts.
  • Iteration Indexing: Each dynamic instance is keyed by (static_channel_id,generation_id,iteration_index)(\text{static\_channel\_id}, \text{generation\_id}, \text{iteration\_index}), where iteration_index=kmod  C\text{iteration\_index} = k \mod C for k∈[0,N)k \in [0, N).

The microarchitecture maintains a per-channel LUT, indexed as LUT[channel][generation_id][iteration_index]→{valid_bit,outcome_bit}\text{LUT[channel][generation\_id][iteration\_index]} \rightarrow \{\text{valid\_bit}, \text{outcome\_bit}\}, enabling correct and collision-free dynamic lookup as the loop executes (Goudarzi et al., 2023).

4. Microarchitectural Integration

BOSS necessitates minimal extensions to the processor frontend, notably:

  • State Tables: Including BranchPC, LoopEndPC, per-channel producer and consumer generation bits, consumer iteration indices, and a simple iterator stack to track control flow squashes.
  • LUT for Outcomes: For 4 channels, 2 generations, and C=256C=256, the LUT requires 4×2×256×24 \times 2 \times 256 \times 2 bits, totaling 4 kbits (512 B).
  • Pipeline Hooks: On BOSS_write, outcome bits are committed to the LUT. Fetching a branch looks up the corresponding outcome, and fallback to the BPU occurs if no valid entry exists. Bookkeeping updates generation toggles, entry validation, and stack rollbacks as dictated by commit, fetch, and squash events at the branch and loop endpoints (Goudarzi et al., 2023).

5. Loop-Level Optimizations: Unrolling and Vectorization

Execution overhead for pre-executing the branch back-slice per loop iteration can be substantial, especially for large trip counts. BOSS enables two orthogonal optimizations:

  • Unrolling: If the pre-execute loop is unrolled by a factor UU, UU outcomes are calculated per loop, reducing the amortized store/fence cost for writing outcomes to the side-channel.
  • Vectorization: For side-band-friendly back-slices (i.e., lacking loop-carried dependencies), SIMD-style vectorization computes multiple outcomes in parallel, allowing their emission as a single wide write into the BOSS buffer. As UU increases, the effective overhead per iteration from store/fence operations diminishes.
  • Break-Even Analysis: Net speedup materializes if eliminated_mispredictions×Misprediction_Penalty>T×Oback+(T/U)×Owrite\text{eliminated\_mispredictions} \times \text{Misprediction\_Penalty} > T \times O_{\text{back}} + (T/U) \times O_{\text{write}}, where TT is the iteration count, ObackO_{\text{back}} is the instruction count per back-slice, and OwriteO_{\text{write}} is the per-write cost (Goudarzi et al., 2023).

6. Example Transformation and Dynamic Behavior

In a canonical case (e.g., Leela’s kill_neighbours loop from SPEC 2017, inner loop size K=4K=4):

  1. BOSS Initialization: BOSS_open invoked with loop-specific parameters prior to entry.
  2. Pre-Execute Loop: Computes the address and comparison required for branch outcome and invokes BOSS_write for each loop index.
  3. Main Loop: Retains original logic; the branch condition is evaluated as usual. On fetch, the frontend resolves the outcome via the LUT, guaranteeing correct prediction where pre-computed data is available.

The hardware’s branch fetch logic first consults the LUT for a software-provided outcome; if valid, the fetch path is set with zero misprediction. Absent a valid software hint, the system falls back to conventional branch prediction (Goudarzi et al., 2023).

7. Performance Evaluation, Limitations, and Extensions

  • Experimental Results: On full-system gem5 simulations (ARM-like 8-issue OoO core, TAGE-SC-L-64KB predictor), BOSS achieves:
    • Up to 95% reduction in MPKI at LDB sites (21% on average).
    • Up to 3× improvement in IPC (23% on average).
    • End-to-end speedups of up to 39% in tight hot spots (7% on average).
    • Manual unrolling and vectorization reduce pre-execute overhead by up to 40%.
  • Trade-Offs: BOSS is most effective when the branch’s misprediction rate exceeds 10 MPKI; for low-mispredict or small trip count cases, overhead amortization is less favorable.
  • Constraints:
    • Insufficient pre-execution lead-time in extremely small trip-count loops (K<4K < 4).
    • Compiler support currently only for single (not nested) inner loop branches.
    • Back-slice code overhead is always incurred; runtime skipping of cold branches is feasible with profiling but not dynamic mispredict-rate gating.
  • Extensions: The framework supports partial iteration-space coverage, reuse of branch bits across iterations, and cross-loop branch correlation. All extensions are safe due to the hint-only, non-invasive software/hardware contract (Goudarzi et al., 2023).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Branch and Pass Algorithm.