Asymmetric Bidirectional Context

Updated 4 July 2026

Asymmetric bidirectional context is a design principle that treats forward and backward information flows as separate, non-equivalent channels.
It enhances efficiency and robustness in applications like language modeling, graph embedding, and control systems by preserving distinct directional cues.
Empirical studies show that using asymmetric mechanisms can improve performance metrics in classification, clustering, and secure communications while addressing trade-offs in stability.

Asymmetric bidirectional context, as the term is used across recent literature, denotes a regime in which dependencies in both directions are retained but are not modeled, weighted, or operationalized symmetrically. Rather than collapsing two-way information into a single reciprocal channel, such methods split forward and backward structure into distinct distributions, masks, controllers, heuristics, or causal scores. This pattern appears in network embedding, language-model pre-training, diffusion language modeling, sampling-based planning, multimodal stereo, vehicle platoons, secure relaying, clock synchronization, and spatio-temporal causality, where the technical objective is to exploit bidirectional information without sacrificing asymmetry, efficiency, robustness, or identifiability (Shen et al., 2021, Artetxe et al., 2022, Chen et al., 26 Jun 2026).

1. Core concept and formal distinctions

A common misconception is that bidirectionality is equivalent to full symmetry. Several papers separate these notions explicitly. In language-model pre-training, bidirectional context and bidirectional attention are distinct controls: the former concerns which tokens are predicted using left and right evidence, while the latter concerns which positions may attend to one another. Artetxe et al. parameterize this with a bidirectional prefix length $b$ , a mask count $m$ , and a predict window $p$ , so that a single framework recovers GPT-style next-token models, BERT/RoBERTa-style masked models, CM3-style hybrids, and prefix-LM variants (Artetxe et al., 2022).

In graph representation learning, the split is between asymmetric structural distributions rather than attention masks. BiGRW defines a forward distribution

$P_f(v\mid u)=\tilde W^{(k)}_{u,v}$

and a backward distribution

$P_b(v\mid u)=\tilde W^{(k)}_{v,u},$

where $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ is a weighted $k$ -step transition matrix. The central claim is that the probability of reaching $v$ from $u$ and the probability of reaching $u$ from $m$ 0 should be learned separately rather than fused into one symmetric context (Shen et al., 2021).

In diffusion LLMs, the distinction becomes architectural. R2LM assigns left context to standard causal attention and right context to a separate reverse Mamba SSM sidecar. The resulting information flow is bidirectional, but the mechanisms are deliberately non-isomorphic: full-fidelity left context remains cacheable through standard KV caching, while right context is injected as a compressed residual that does not invalidate the prefix cache (Chen et al., 26 Jun 2026).

These formulations indicate that asymmetric bidirectional context is not a single algorithmic template. It is a design principle in which the two directions are both preserved and made non-equivalent.

2. Representation learning and pre-training formulations

BiGRW realizes asymmetric bidirectional context through separate forward and backward Skip-Gram objectives. Each node $m$ 1 has a source embedding $m$ 2 and two target embeddings $m$ 3, one for forward contexts and one for backward contexts. The model predicts

$m$ 4

and trains with two negative-sampling losses $m$ 5 and $m$ 6, combined as $m$ 7. Its walk sampler draws a walk length $m$ 8 with probability proportional to $m$ 9, so the single parameter $p$ 0 interpolates between short-walk, more BFS-style sampling and long-walk, more DFS-style sampling. BiGRW-AT extends the same factorization to attributed graphs by tying target embeddings to node attributes through trainable matrices $p$ 1 (Shen et al., 2021).

The reported empirical effect is systematic rather than marginal. On node classification with 50% labeled data, Cora improves from DeepWalk 0.863 to BiGRW 0.903, Citeseer improves from node2vec 0.712 to BiGRW 0.780, and Cora with attributes improves from GAE 0.900 to BiGRW-AT 0.923. On clustering, Cora improves from DeepWalk $p$ 2 to BiGRW $p$ 3 in Purity/NMI, Citeseer improves from LINE $p$ 4 to BiGRW $p$ 5, and Cora-AT improves from GAE $p$ 6 to BiGRW-AT $p$ 7 (Shen et al., 2021).

Artetxe et al. provide an analogous decomposition for token models. Their attention mask is

$p$ 8

which creates a fully connected prefix block and a causal suffix. With suitable choices of $p$ 9, the framework recovers NxtUni, NxtPre, MskBi, HybUni, and HybPre. The empirical trade-off is explicitly application-dependent: pure GPT-style next-token models win next-token perplexity and zero-shot priming, fully bidirectional masked models are best on single-token infilling and GLUE fine-tuning, HybPre is a middle ground, and these orderings remain consistent up to 6.7B parameters. A critical negative result is that switching from unidirectional attention in pre-training to bidirectional attention in fine-tuning, or the reverse, causes performance collapse (Artetxe et al., 2022).

Taken together, these works formalize asymmetric bidirectionality as a separation of prediction targets, structural distributions, or attention regimes rather than a simple decision to “use both sides.”

3. Cache-preserving bidirectionality in parallel generation

R2LM was introduced to resolve a concrete systems dilemma in discrete diffusion LLMs. Fully bidirectional attention yields strong modeling quality, but it breaks prefix KV caching because keys and values depend on future tokens; causal attention preserves caching, but loses all right-side context. R2LM addresses this by augmenting a pretrained causal Transformer decoder with a lightweight reverse-direction Mamba SSM sidecar attached at a subset of decoder layers (Chen et al., 26 Jun 2026).

At a hooked layer $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 0, the backbone hidden state $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 1 is reversed, processed by Mamba, flipped back, normalized, and injected through a gated residual,

$P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 2

with $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 3 initialized to $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 4 so that the model is initially bit-identical to the causal baseline. The sidecar does not insert keys or values into self-attention, so cached prefix keys and values remain valid. During denoising, the Transformer reuses prompt KV caches and the sidecar scans only the generation block, preserving the stated $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 5 per-step cost rather than $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 6 (Chen et al., 26 Jun 2026).

The throughput results quantify the systems advantage. On a single H100 with 32 steps, 128 generation tokens, and batch $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 7, bidirectional dLLM throughput drops from 483 tok/s at prompt length $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 8 to 53 tok/s at $P_f(v\mid u)=\tilde W^{(k)}_{u,v}$ 9, whereas R2LM goes from 1,154 tok/s to 683 tok/s, corresponding to $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 0 over bidirectional at $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 1 and $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 2 at $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 3. The causal dLLM is slightly faster, at 1,356 tok/s and 732 tok/s, but lacks the right-context mechanism. Against an autoregressive Qwen3-1.7B baseline at $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 4, R2LM is approximately $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 5 faster (Chen et al., 26 Jun 2026).

Quality does not reduce to a pure throughput trade-off. On seven multiple-choice tasks after 60B-token continued pretraining of Qwen3-1.7B, R2LM reaches 47.44% on long-target average versus 44.78% for bidirectional dLLM and 43.90% for causal dLLM, 48.40% on short-target average versus 51.65% and 46.35%, and 47.71% overall versus 46.74% and 44.60%. Parameter-matched ablations on Wikitext-103 show PPL = 178 for causal only, PPL $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 6 for a 185M MLP adapter, PPL $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 7 for a 185M left-to-right Mamba, and PPL $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 8 for the 185M reverse Mamba sidecar, compared with bidirectional PPL $P_b(v\mid u)=\tilde W^{(k)}_{v,u},$ 9 (Chen et al., 26 Jun 2026).

This suggests that asymmetric bidirectional context can function as an efficiency-preserving decomposition: precision is retained on the left through causal attention, while useful but compressed right information is supplied through a non-attention path.

4. Search, planning, and multimodal perception

In AIT* and EIT, asymmetry is procedural rather than representational. Both algorithms maintain a random geometric graph and grow a forward tree $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 0 from the start and a reverse tree $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 1 from the goal. The reverse tree is deliberately cheaper: it uses coarse collision checks or none at all to compute admissible cost-to-go heuristics $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 2, while the forward tree performs full collision checking and cost evaluation. The reverse search therefore informs the forward search continuously, but the two directions do not play the same role. Under the usual random geometric graph assumptions, including strong- $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 3 clearance and cost bounded by path length, AIT and EIT* are almost-surely asymptotically optimal. Experiments on twelve problems show that EIT* outperforms baselines on obstacle-clearance objectives, where a priori heuristics are often ineffective, while AIT* matches or slightly outperforms BIT* on path length (Strub et al., 2021).

Bi-CMPStereo uses a different asymmetry: bidirectional prompting across modalities. It contains two complementary branches. In evCMPStereo, events are treated as the target domain and frames as the source; in imgCMPStereo, frames are the target and events the source. A Cross-Domain Embedding Adapter projects the source into a coarse target-style embedding, target and adapted-source encoders map both into a shared canonical latent space, and a shared stereo decoder produces multi-scale features. After pre-training, both branches are frozen and used as cross-modal prompt providers; their cost volumes are fused by channel-wise concatenation at $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 4 and $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 5 scales and by a small 3D-hourglass network at $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 6 scale (Xu et al., 16 Apr 2026).

The empirical motivation is that the two directional promptings preserve different cues. The paper states that alternating the target space prevents color and detail cues from frames and high-frequency temporal edges from events from being marginalized. On DSEC, fusing the two branches reduces MAE from 0.565 to 0.532 and 1PE from 11.43% to 10.61%. Ablations report MAE increases of +0.018 without CDEA, +0.024 without the stereo canonicalization constraint, and +0.023 without cascaded levels; in cross-dataset DSEC $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 7MVSEC tests, disabling HVT raises 2PE from 32.12% to 36.00% (Xu et al., 16 Apr 2026).

These examples show that asymmetric bidirectional context can arise from asymmetric computational budgets, asymmetric validation costs, or asymmetric modality roles, not only from directional sequence masks.

5. Dynamical systems and causal inference

Vehicle-platoon control provides a precise control-theoretic treatment of bidirectional asymmetry. Monteil et al. consider heterogeneous nonlinear platoons in which each vehicle is coupled to its predecessor, its follower with weight $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 8, and optionally the leader. Here $W^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell$ 9 is predecessor–follower, $k$ 0 is symmetric bidirectional, and $k$ 1 is asymmetric coupling. Under conditions C1–C3 on vanishing couplings at the desired trajectory, contraction of the self-dynamics, and bounds on cross-coupling Jacobians, they derive the ISS-type bound

$k$ 2

In a numerical example with $k$ 3 vehicles, the worst-case position deviation peak is about 2.2 m for $k$ 4 versus 1.9 m for $k$ 5, and the speed deviation peak is 1.9 m/s versus 1.7 m/s (Monteil et al., 2018).

A complementary linear analysis distinguishes where asymmetry is placed. With symmetric position coupling and asymmetric velocity coupling, the smallest damping singular value scales as $k$ 6 and disturbance amplification as $k$ 7. With symmetric coupling in both position and velocity, the scaling is $k$ 8. When position coupling is asymmetric, the smallest singular value becomes exponentially small and the disturbance bound grows as $k$ 9; for some $v$ 0, even finite strings can become unstable. The design guideline given is to keep position coupling symmetric and use only velocity asymmetry to improve disturbance amplification (Herman et al., 2016).

A third platoon result shows that asymmetry can restore string stability only under restricted disturbance models. If $v$ 1 with $v$ 2 and $v$ 3, then an asymmetric bidirectional controller achieves $v$ 4 string stability when disturbances act only on a fixed number of leading vehicles, independent of platoon length. The same paper proves that no choice of gains can make both directional flow gains simultaneously bounded by $v$ 5 for arbitrary disturbance distributions (Farnam et al., 2016).

In urban systems, the term appears as asymmetric bidirectional causality rather than control. A spatio-temporal weighted regression produces localized urban composite indicators, and spatio-temporal convergent cross-mapping estimates causal skill in both directions. The asymmetry index is

$v$ 6

where $v$ 7 denotes an urban-system component and $v$ 8 traffic dynamics. Across 30 cities on rest days, the reported means are 0.52 versus 0.43 for structure, 0.57 versus 0.49 for form, and 0.55 versus 0.46 for function, yielding average asymmetries of +0.09, +0.08, and +0.09. The study also identifies three city archetypes: tightly coupled, pattern-heterogeneous, and workday-attenuated (Zhang et al., 29 Oct 2025).

Across these dynamical settings, asymmetry is beneficial only when it is aligned with the mechanism of propagation. Velocity asymmetry can improve scaling; position asymmetry can produce exponential amplification; causal influence can be bidirectional while still directionally imbalanced.

6. Failure modes, security implications, and recurring limits

The most direct failure mode is broken reciprocity. In the entanglement-based two-way clock-synchronization protocol, the standard midpoint estimate

$v$ 9

is valid only if $u$ 0. An attack using optical circulators inserts an asymmetry

$u$ 1

so that

$u$ 2

The attack succeeds because the Bell state is unchanged up to a global sign after the two circulators. Experimentally, full state tomography reports fidelities $u$ 3 before and $u$ 4 after insertion, so the protocol’s entanglement-based tampering test does not detect the asymmetry (Lee et al., 2019).

In secure bidirectional relaying, the limiting factor is asymmetric channel gain rather than delay. With nested-lattice compute-and-forward, secrecy is achievable only when the gain ratio is rational, $u$ 5, with co-prime integers $u$ 6 and additional group-order constraints on the quotient lattice. If $u$ 7 is irrational, then in the noiseless limit the relay observation $u$ 8 uniquely determines $u$ 9, so secrecy fails. The admissible shaping parameter shrinks as asymmetry grows, and the numerical illustration with $u$ 0 and $u$ 1 gives perfect-secrecy rate bounds of about 1.19 bits/dim for ratio $u$ 2, 0.60 bits/dim for $u$ 3, and 0.19 bits/dim for $u$ 4 (Vatedka et al., 2015).

A further misconception is that “more bidirectionality” is always better. In pre-training, pure causal models remain best for next-token prediction and zero-shot priming, while fully bidirectional masked models are best for infilling and fine-tuning, and hybrid causal-plus-masked models underperform both extremes at all scales studied. In platoons, asymmetry in velocity coupling improves scaling, but asymmetry in position coupling can cause exponential growth or instability. In synchronization, unmeasured asymmetry invalidates the midpoint estimator altogether (Artetxe et al., 2022, Herman et al., 2016, Lee et al., 2019).

A plausible implication is that asymmetric bidirectional context is best understood as a constrained decomposition of two-way dependence. It is useful when the decomposition respects the problem’s operational asymmetries—forward versus backward walk statistics, left versus right token utility, heuristic computation versus edge validation, or urban influence versus traffic feedback—but it is harmful when it violates hidden reciprocity assumptions or places asymmetry in the wrong channel.