Cross Learning Block (CLB) Architecture
- CLB is an architectural enhancement for temporal decision modeling that uses pairwise masked cross-attention to explicitly model inter-sequence dependencies between state, action, and return-to-go streams.
- Its three-stream design replaces the traditional Decision Transformer approach, resulting in improved causal reasoning and a relative performance gain of up to 15.3% on the AuctionNet benchmark.
- By enforcing explicit attention flows and using Pre-LayerNorm with residual connections, CLB delivers robust and discriminative embeddings, essential for precision in auto-bidding and sequential decision-making tasks.
The Cross Learning Block (CLB) is an architectural enhancement for temporal decision modeling, introduced to address limitations in the vanilla Decision Transformer (DT). DT’s approach—concatenation of state, action, and return-to-go (RTG) sequences followed by self-attention—obscures sequence distinctions and fails to model explicit inter-sequence dependencies. CLB replaces standard DT Transformer blocks with modules that exchange information along explicit streams via pairwise masked cross-attention, yielding improved inter-sequence correlation modeling and discriminative embeddings. Originating in the C2 framework for generative auto-bidding, CLB has demonstrated substantial performance gains on the AuctionNet benchmark, especially when combined with constraint-aware objectives (Ding et al., 28 Jan 2026).
1. Structural Architecture
In place of a single self-attention flow over a concatenated input, the CLB is structured as a three-stream block, maintaining separate representations for state (), action (), and return-to-go () sequences throughout the model stack. For a trajectory segment of length , the prepared inputs are:
A shared positional embedding is applied to each sequence:
- , ,
These streams are updated in parallel through a sequence of Cross Learning Blocks. Each block executes two pairwise masked cross-attention operations for each stream, followed by Pre-LayerNorm, residual connections, and a position-wise feed-forward (FF) module.
CLB Stream Update Example:
- For action stream at block :
Analogous formulations update and with attention from the other two streams.
2. Mathematical Definition of Masked Cross-Attention
Each cross-attention operation is defined as follows. For input streams :
Parameters are , , with . A binary mask enforces causality and valid cross-stream communication. The masked cross-attention is:
Each stream update leverages two attention heads---one per other stream. The feed-forward sub-block is a two-layer ReLU MLP with dropout.
3. Inter-Stream Correlation Modeling
CLB enforces explicit pairwise cross-attention across state, action, and RTG, in contrast to the vanilla DT approach:
- Action stream receives and
- State stream receives and
- RTG stream receives and
This design compels the model to learn explicit causal dependencies, such as how state and RTG modulate action selection or how actions shape future states and returns. Empirically, shuffling the state sequence in C2 drops the cosine similarity of CLB-learned embeddings by over 50% (mean similarity 0.36) compared to DT embeddings (0.74), indicating more discriminative representations (Ding et al., 28 Jan 2026).
4. Implementation Specifications
CLB implementation departs from standard Transformer details as follows:
- Hidden dimension: (e.g.), attention head dimension
- Feed-forward dimension:
- Number of CLB layers:
- Single-head cross-attention (extendable to multi-head)
- Pre-LayerNorm (LN→Attn→add→LN→FF→add) for improved training stability
- Residual connections after each attention and FF module:
- Dropout rate: $0.1$ after FF’s second linear layer
- Learned positional encodings via (no sinusoidal encodings)
- Attention masks prohibit illegal sequence positions, enforcing causality and appropriate cross-stream alignment
5. Empirical Performance and Ablation Findings
Ablation studies on AuctionNet (100% budget setting) quantify the contribution of each component. The following table summarizes the key results:
| Model | Score | Relative Gain |
|---|---|---|
| DT | 33.3 | — |
| +CL | 35.7 | +7.2% |
| +CLB | 36.7 | +10.2% |
| +CLB + CL | 38.4 | +15.3% |
Replacing standard Transformer blocks with CLB results in a 10.2% relative uplift over vanilla DT. The full C2 model, combining CLB and constraint-aware loss, yields a 15.3% improvement (Ding et al., 28 Jan 2026). Performance gains are most pronounced under standard budget settings, though CLB also enhances robustness under severe budget constraints (50% or 150%) and in data-scarce or short-trajectory regimes.
6. Functional Insights and Practical Deployment
CLB’s superiority over vanilla attention arises from:
- Explicitly disentangled attention flows, which clarify the structural roles of states, actions, and RTG
- Dual cross-heads per stream, enabling targeted incorporation of relevant information and improved causal reasoning (e.g., direct modulation of action selection by remaining budget signals)
Recommended practical considerations:
- Use Pre-LN ordering for training stability
- Employ Xavier uniform initialization for
- Tune and based on compute budget
- Implement masked cross-attention using native Transformer libraries with custom Q/K/V and mask inputs
By architecting information flow explicitly among state, action, and RTG streams, CLB addresses the correlation-modeling limitations of vanilla Decision Transformers, consistently delivering statistically significant performance improvements in auto-bidding and related sequential decision-making tasks (Ding et al., 28 Jan 2026).