Tree-Based Invariant Kernels (TBIK)

Updated 25 November 2025

Tree-Based Invariant Kernels are deterministic computational primitives that eliminate floating-point nondeterminism by enforcing a fixed binary-tree reduction order.
They use custom Triton kernels and CUDA-based tree all-reduce steps to maintain bit-identical outputs regardless of tensor parallelism size.
Empirical evaluations demonstrate TBIK’s ability to ensure reproducibility in reinforcement learning and distributed LLM settings while managing overheads.

Tree-Based Invariant Kernels (TBIK) are deterministic computational primitives designed to guarantee bit-identical results for matrix multiplication and reduction operations in LLM frameworks utilizing tensor parallelism (TP), regardless of the number of GPUs involved. They address floating-point non-determinism inherent in distributed inference, especially critical in reinforcement learning (RL) contexts and any setting demanding exact reproducibility across variable system configurations. TBIK achieves this determinism through a unified, hierarchical binary-tree reduction pattern, aligning intra- and inter-GPU computations to eliminate variability arising from the non-associativity of floating-point addition (Zhang et al., 21 Nov 2025).

1. Motivation: Determinism and the Floating-Point Reduction Problem

LLM serving infrastructures frequently employ tensor parallelism to distribute model weights and computations over multiple GPUs. Traditional reduction operations in TP—such as the summation of partial matrix products for row-parallel linear layers—follow hardware or software-imposed orders that inherently vary with the parallelism degree (i.e., number or arrangement of GPUs). Because floating-point addition is non-associative (IEEE 754), sum results depend on the order of operation, leading to nondeterministic outputs when system configurations change, even with fixed random seeds and greedy decoding. In RL workflows, this inconsistency can create critical mismatches between the rollout (typically with TP>1, e.g., in vLLM) and training (commonly FSDP with TP=1) phases, degrading on-policy learning stability and potentially causing divergence or collapse (Zhang et al., 21 Nov 2025).

Conventional batch-invariant kernels (BIO) address non-determinism along the batch dimension but do not resolve changes induced by varying TP size. TBIK completes this determinism stack by enforcing a globally consistent reduction order that is independent of TP size.

2. Mathematical Structure of Tree-Based Invariant Kernels

Let $A \in \mathbb{R}^{M \times K}$ and $B \in \mathbb{R}^{K \times N}$ , with $K$ partitioned into $T$ tiles. The standard distributed MatMul under TP with $C$ GPUs assigns blocks of $A$ and $B$ to each device, computes a local sum $L_d$ for device $d$ , and then uses an All-Reduce operation:

$C = \sum_{d=0}^{C-1} L_d$

TBIK introduces a reduction operator $T(\cdot)$ implementing pairwise floating-point addition (“ $\oplus$ ”) through a perfectly balanced binary tree:

For $N$ $N$ tiles $k_1, \ldots, k_N$ $k_{1}, \dots, k_{N}$ , $T(k_1,\ldots,k_N)$ $T (k_{1}, \dots, k_{N})$ recursively defines
- $T(k_1, k_2) = k_1 \oplus k_2$
- $T(k_1, \ldots, k_{2^t}) = T(T(k_1,\ldots,k_{2^{t-1}}), T(k_{2^{t-1}+1}, \ldots, k_{2^t}))$ for $t>1$ .

This reduction is performed both (a) intra-GPU, across tiles local to each device, and (b) inter-GPU, via a tree-based All-Reduce across devices. By partitioning tiles into contiguous blocks and enforcing identical binary tree ordering at every stage, the final result $C$ is provably invariant to the value of $C$ (the TP size). The reduction order is entirely determined by the binary tree structure, removing hardware and software-induced variation (Zhang et al., 21 Nov 2025).

3. Algorithmic Realization and Pseudocode

TBIK is implemented in two algorithmic stages:

Intra-GPU Tree-Reduce (MatMul):

Each GPU divides its assigned tiles further and accumulates sums in arrays of accumulators using a tree-reduction pattern. Carry logic ensures proper balancing even when the tile count is not a power of two. Pseudocode flows: load tile, accumulate at level, propagate carries up the tree, and finalize partial sum.

Inter-GPU Tree All-Reduce:

Instead of the typical ring-based reduction, a custom All-Gather is performed followed by a binary tree-structured summation over the partial results from all devices. This mimics the same deterministic tree pattern used locally.

This design ensures that, regardless of the batch size, prompt, or hardware configuration (assuming block size compatibility), the reduction order is identical at every level (Zhang et al., 21 Nov 2025).

4. Theoretical Guarantee of Tensor-Parallel Invariance

Theoretical correctness of TBIK’s invariance is established by induction:

Let $N$ be the total number of tiles, $C$ (a power of two) the TP size, and $L_d$ the sum over tiles on device $d$ . The tree reduction satisfies

$T(k_1, ..., k_N) = T(T(L_0), ..., T(L_{C-1}))$

regardless of how intra- and inter-GPU reductions are grouped. The essential property is that the global reduction tree is invariant to the granularity at which tiles are initially grouped and processed, as long as all partial sums are combined through the same hierarchical binary tree. Since floating-point addition is deterministic under a fixed operation sequence, the output is bit-identical for any TP size (Zhang et al., 21 Nov 2025).

5. Implementation in Modern LLM Frameworks

TBIK is realized as custom Triton kernels for intra-GPU tree-reduce MatMul and CUDA code for the tree-based All-Reduce collective. Non-power-of-two tile counts are handled via explicit parameterization (e.g., “K_first” carries). Integration with the vLLM and FSDP frameworks includes:

Uniform replacement of all row-parallel linear layers with TBIK-wrapped operations.
BIO kernels applied to column-parallel layers for batch-size invariance.
Strict matching of block sizes, precision, normalization, and attention implementations across frameworks.
Engineering accommodations such as shared memory management for accumulators and disabling optimizations (e.g., vLLM chunked prefill) that would introduce variance (Zhang et al., 21 Nov 2025).

6. Empirical Evaluation and Performance Characteristics

Experiments on benchmarks (AIME’24, AMC’23) and contemporary models (Qwen3-8B, Qwen3-32B, Mistral-7B Instruct, Llama-3.1-8B Instruct) under varying TP and batch sizes demonstrate:

Determinism Metrics:
- Unique output sequences ( $U$ ): $U = 1$ for all configs with BIO+TBIK, versus $U \approx$ number of configs for vanilla BF16 and $U \approx 7-8$ for BIO alone when TP varies.
- Probability divergence ( $D$ ): $D=0$ (bit-identical distributions) for BIO+TBIK; $D=5-30\times 10^{-3}$ otherwise.
RL Consistency:
- With TBIK, token probability computations for rollouts in vLLM (TP>1) and FSDP (TP=1) match exactly, eliminating previously observed gaps of $10^{-3}$ – $10^{-2}$ per token seen with only BIO in place.
Overhead:
- Triton tree-reduce MatMul achieves $\sim 120$ TFLOPS BF16 (63% of cuBLAS peak).
- Cumulative latency overhead (BIO+TBIK) ranges from 56–135% depending on sequence lengths, with the custom tree All-Reduce responsible for the largest share—especially in the absence of NVLink (Zhang et al., 21 Nov 2025).

Overhead is most pronounced in the unoptimized tree All-Reduce step. Performance optimization remains an active area of research.

7. Implications, Limitations, and Future Directions

By enforcing exact bit-wise equivalence on LLM outputs regardless of TP size, TBIK unlocks fully deterministic deployments and closes the training–inference gap in RL workflows, enabling true on-policy reinforcement learning even under framework or hardware heterogeneity. Deterministic evaluation—essential for audit, debugging, LLM-as-a-judge, and multi-agent systems—is now robust to the effects of floating-point drift (Zhang et al., 21 Nov 2025).

Future work involves:

Designing native tree All-Reduce collectives for high throughput;
Optimizing block sizes, shared memory layouts, and warp-level operations to approach cuBLAS efficiency;
Extending to low-bit quantized GEMMs to address nondeterminism from rounding and fused computations.

Fundamentally, TBIK establishes a practical, theoretically complete solution for removal of TP-induced floating-point nondeterminism, supporting rigorous scientific reproducibility in modern, distributed neural computation (Zhang et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Tree-Based Invariant Kernels (TBIK).