Cross Learning Block (CLB) Architecture

Updated 4 February 2026

CLB is an architectural enhancement for temporal decision modeling that uses pairwise masked cross-attention to explicitly model inter-sequence dependencies between state, action, and return-to-go streams.
Its three-stream design replaces the traditional Decision Transformer approach, resulting in improved causal reasoning and a relative performance gain of up to 15.3% on the AuctionNet benchmark.
By enforcing explicit attention flows and using Pre-LayerNorm with residual connections, CLB delivers robust and discriminative embeddings, essential for precision in auto-bidding and sequential decision-making tasks.

The Cross Learning Block (CLB) is an architectural enhancement for temporal decision modeling, introduced to address limitations in the vanilla Decision Transformer (DT). DT’s approach—concatenation of state, action, and return-to-go (RTG) sequences followed by self-attention—obscures sequence distinctions and fails to model explicit inter-sequence dependencies. CLB replaces standard DT Transformer blocks with modules that exchange information along explicit streams via pairwise masked cross-attention, yielding improved inter-sequence correlation modeling and discriminative embeddings. Originating in the C2 framework for generative auto-bidding, CLB has demonstrated substantial performance gains on the AuctionNet benchmark, especially when combined with constraint-aware objectives (Ding et al., 28 Jan 2026).

1. Structural Architecture

In place of a single self-attention flow over a concatenated input, the CLB is structured as a three-stream block, maintaining separate representations for state ( $S$ ), action ( $A$ ), and return-to-go ( $R$ ) sequences throughout the model stack. For a trajectory segment of length $M$ , the prepared inputs are:

$S \in \mathbb{R}^{M \times d_s}$
$A \in \mathbb{R}^{M \times d_a}$
$R \in \mathbb{R}^{M \times 1}$

A shared positional embedding $T^{(0)} = \mathrm{Linear}_T(\text{TimestepIndices}) \in \mathbb{R}^{M \times d_h}$ is applied to each sequence:

$S^{(1)} = \mathrm{Linear}_S(S) + T^{(0)}$ , $A^{(1)} = \mathrm{Linear}_A(A) + T^{(0)}$ , $R^{(1)} = \mathrm{Linear}_R(R) + T^{(0)}$

These streams are updated in parallel through a sequence of $B$ Cross Learning Blocks. Each block executes two pairwise masked cross-attention operations for each stream, followed by Pre-LayerNorm, residual connections, and a position-wise feed-forward (FF) module.

CLB Stream Update Example:

For action stream $A^b$ at block $b$ :

$A_{\mathrm{attn}} = A^b + \mathrm{Attn}(\mathrm{LN}(S^b), \mathrm{LN}(A^b), \mathrm{LN}(A^b)) + \mathrm{Attn}(\mathrm{LN}(R^b), \mathrm{LN}(A^b), \mathrm{LN}(A^b))$

$A^{b+1} = \mathrm{FF}(\mathrm{LN}(A_\mathrm{attn}))$

Analogous formulations update $S$ and $R$ with attention from the other two streams.

2. Mathematical Definition of Masked Cross-Attention

Each cross-attention operation is defined as follows. For input streams $X^b, Y^b \in \mathbb{R}^{M \times d_h}$ :

$Q = X^b W_Q + \mathbf{1} b_Q^\top$
$K = Y^b W_K + \mathbf{1} b_K^\top$
$V = Y^b W_V + \mathbf{1} b_V^\top$

Parameters are $W_Q, W_K, W_V \in \mathbb{R}^{d_h \times d_k}$ , $b_Q, b_K, b_V \in \mathbb{R}^{d_k}$ , with $d_k = d_h$ . A binary mask $\mathrm{Mask} \in \{0, -\infty\}^{M \times M}$ enforces causality and valid cross-stream communication. The masked cross-attention is:

$\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} + \mathrm{Mask} \right) V$

Each stream update leverages two attention heads---one per other stream. The feed-forward sub-block is a two-layer ReLU MLP with dropout.

3. Inter-Stream Correlation Modeling

CLB enforces explicit pairwise cross-attention across state, action, and RTG, in contrast to the vanilla DT approach:

Action stream receives $S \rightarrow A$ and $R \rightarrow A$
State stream receives $A \rightarrow S$ and $R \rightarrow S$
RTG stream receives $A \rightarrow R$ and $S \rightarrow R$

This design compels the model to learn explicit causal dependencies, such as how state and RTG modulate action selection or how actions shape future states and returns. Empirically, shuffling the state sequence in C2 drops the cosine similarity of CLB-learned embeddings by over 50% (mean similarity 0.36) compared to DT embeddings (0.74), indicating more discriminative representations (Ding et al., 28 Jan 2026).

4. Implementation Specifications

CLB implementation departs from standard Transformer details as follows:

Hidden dimension: $d_h = 256$ (e.g.), attention head dimension $d_k = d_h$
Feed-forward dimension: $d_{ff} = 4 d_h$
Number of CLB layers: $B = 6$
Single-head cross-attention (extendable to multi-head)
Pre-LayerNorm (LN→Attn→add→LN→FF→add) for improved training stability
Residual connections after each attention and FF module:

$X' = X + \sum_{Y \neq X} \mathrm{Attn}(\mathrm{LN}(Y), \mathrm{LN}(X), \mathrm{LN}(X))$

$X_{\text{out}} = \mathrm{FF}(\mathrm{LN}(X'))$

Dropout rate: $0.1$ after FF’s second linear layer
Learned positional encodings via $\mathrm{Linear}_T$ (no sinusoidal encodings)
Attention masks prohibit illegal sequence positions, enforcing causality and appropriate cross-stream alignment

5. Empirical Performance and Ablation Findings

Ablation studies on AuctionNet (100% budget setting) quantify the contribution of each component. The following table summarizes the key results:

Model	Score	Relative Gain
DT	33.3	—
+CL	35.7	+7.2%
+CLB	36.7	+10.2%
+CLB + CL	38.4	+15.3%

Replacing standard Transformer blocks with CLB results in a 10.2% relative uplift over vanilla DT. The full C2 model, combining CLB and constraint-aware loss, yields a 15.3% improvement (Ding et al., 28 Jan 2026). Performance gains are most pronounced under standard budget settings, though CLB also enhances robustness under severe budget constraints (50% or 150%) and in data-scarce or short-trajectory regimes.

6. Functional Insights and Practical Deployment

CLB’s superiority over vanilla attention arises from:

Explicitly disentangled attention flows, which clarify the structural roles of states, actions, and RTG
Dual cross-heads per stream, enabling targeted incorporation of relevant information and improved causal reasoning (e.g., direct modulation of action selection by remaining budget signals)

Recommended practical considerations:

Use Pre-LN ordering for training stability
Employ Xavier uniform initialization for $\mathrm{Linear}_{T/S/A/R}$
Tune $B \in \{4,6,8\}$ and $d_h \in \{128, 256, 512\}$ based on compute budget
Implement masked cross-attention using native Transformer libraries with custom Q/K/V and mask inputs

By architecting information flow explicitly among state, action, and RTG streams, CLB addresses the correlation-modeling limitations of vanilla Decision Transformers, consistently delivering statistically significant performance improvements in auto-bidding and related sequential decision-making tasks (Ding et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross Learning Block (CLB).