Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymmetric Bidirectional Context

Updated 4 July 2026
  • Asymmetric bidirectional context is a design principle that treats forward and backward information flows as separate, non-equivalent channels.
  • It enhances efficiency and robustness in applications like language modeling, graph embedding, and control systems by preserving distinct directional cues.
  • Empirical studies show that using asymmetric mechanisms can improve performance metrics in classification, clustering, and secure communications while addressing trade-offs in stability.

Asymmetric bidirectional context, as the term is used across recent literature, denotes a regime in which dependencies in both directions are retained but are not modeled, weighted, or operationalized symmetrically. Rather than collapsing two-way information into a single reciprocal channel, such methods split forward and backward structure into distinct distributions, masks, controllers, heuristics, or causal scores. This pattern appears in network embedding, language-model pre-training, diffusion language modeling, sampling-based planning, multimodal stereo, vehicle platoons, secure relaying, clock synchronization, and spatio-temporal causality, where the technical objective is to exploit bidirectional information without sacrificing asymmetry, efficiency, robustness, or identifiability (Shen et al., 2021, Artetxe et al., 2022, Chen et al., 26 Jun 2026).

1. Core concept and formal distinctions

A common misconception is that bidirectionality is equivalent to full symmetry. Several papers separate these notions explicitly. In language-model pre-training, bidirectional context and bidirectional attention are distinct controls: the former concerns which tokens are predicted using left and right evidence, while the latter concerns which positions may attend to one another. Artetxe et al. parameterize this with a bidirectional prefix length bb, a mask count mm, and a predict window pp, so that a single framework recovers GPT-style next-token models, BERT/RoBERTa-style masked models, CM3-style hybrids, and prefix-LM variants (Artetxe et al., 2022).

In graph representation learning, the split is between asymmetric structural distributions rather than attention masks. BiGRW defines a forward distribution

Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}

and a backward distribution

Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},

where W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell is a weighted kk-step transition matrix. The central claim is that the probability of reaching vv from uu and the probability of reaching uu from mm0 should be learned separately rather than fused into one symmetric context (Shen et al., 2021).

In diffusion LLMs, the distinction becomes architectural. R2LM assigns left context to standard causal attention and right context to a separate reverse Mamba SSM sidecar. The resulting information flow is bidirectional, but the mechanisms are deliberately non-isomorphic: full-fidelity left context remains cacheable through standard KV caching, while right context is injected as a compressed residual that does not invalidate the prefix cache (Chen et al., 26 Jun 2026).

These formulations indicate that asymmetric bidirectional context is not a single algorithmic template. It is a design principle in which the two directions are both preserved and made non-equivalent.

2. Representation learning and pre-training formulations

BiGRW realizes asymmetric bidirectional context through separate forward and backward Skip-Gram objectives. Each node mm1 has a source embedding mm2 and two target embeddings mm3, one for forward contexts and one for backward contexts. The model predicts

mm4

and trains with two negative-sampling losses mm5 and mm6, combined as mm7. Its walk sampler draws a walk length mm8 with probability proportional to mm9, so the single parameter pp0 interpolates between short-walk, more BFS-style sampling and long-walk, more DFS-style sampling. BiGRW-AT extends the same factorization to attributed graphs by tying target embeddings to node attributes through trainable matrices pp1 (Shen et al., 2021).

The reported empirical effect is systematic rather than marginal. On node classification with 50% labeled data, Cora improves from DeepWalk 0.863 to BiGRW 0.903, Citeseer improves from node2vec 0.712 to BiGRW 0.780, and Cora with attributes improves from GAE 0.900 to BiGRW-AT 0.923. On clustering, Cora improves from DeepWalk pp2 to BiGRW pp3 in Purity/NMI, Citeseer improves from LINE pp4 to BiGRW pp5, and Cora-AT improves from GAE pp6 to BiGRW-AT pp7 (Shen et al., 2021).

Artetxe et al. provide an analogous decomposition for token models. Their attention mask is

pp8

which creates a fully connected prefix block and a causal suffix. With suitable choices of pp9, the framework recovers NxtUni, NxtPre, MskBi, HybUni, and HybPre. The empirical trade-off is explicitly application-dependent: pure GPT-style next-token models win next-token perplexity and zero-shot priming, fully bidirectional masked models are best on single-token infilling and GLUE fine-tuning, HybPre is a middle ground, and these orderings remain consistent up to 6.7B parameters. A critical negative result is that switching from unidirectional attention in pre-training to bidirectional attention in fine-tuning, or the reverse, causes performance collapse (Artetxe et al., 2022).

Taken together, these works formalize asymmetric bidirectionality as a separation of prediction targets, structural distributions, or attention regimes rather than a simple decision to “use both sides.”

3. Cache-preserving bidirectionality in parallel generation

R2LM was introduced to resolve a concrete systems dilemma in discrete diffusion LLMs. Fully bidirectional attention yields strong modeling quality, but it breaks prefix KV caching because keys and values depend on future tokens; causal attention preserves caching, but loses all right-side context. R2LM addresses this by augmenting a pretrained causal Transformer decoder with a lightweight reverse-direction Mamba SSM sidecar attached at a subset of decoder layers (Chen et al., 26 Jun 2026).

At a hooked layer Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}0, the backbone hidden state Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}1 is reversed, processed by Mamba, flipped back, normalized, and injected through a gated residual,

Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}2

with Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}3 initialized to Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}4 so that the model is initially bit-identical to the causal baseline. The sidecar does not insert keys or values into self-attention, so cached prefix keys and values remain valid. During denoising, the Transformer reuses prompt KV caches and the sidecar scans only the generation block, preserving the stated Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}5 per-step cost rather than Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}6 (Chen et al., 26 Jun 2026).

The throughput results quantify the systems advantage. On a single H100 with 32 steps, 128 generation tokens, and batch Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}7, bidirectional dLLM throughput drops from 483 tok/s at prompt length Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}8 to 53 tok/s at Pf(vu)=W~u,v(k)P_f(v\mid u)=\tilde W^{(k)}_{u,v}9, whereas R2LM goes from 1,154 tok/s to 683 tok/s, corresponding to Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},0 over bidirectional at Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},1 and Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},2 at Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},3. The causal dLLM is slightly faster, at 1,356 tok/s and 732 tok/s, but lacks the right-context mechanism. Against an autoregressive Qwen3-1.7B baseline at Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},4, R2LM is approximately Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},5 faster (Chen et al., 26 Jun 2026).

Quality does not reduce to a pure throughput trade-off. On seven multiple-choice tasks after 60B-token continued pretraining of Qwen3-1.7B, R2LM reaches 47.44% on long-target average versus 44.78% for bidirectional dLLM and 43.90% for causal dLLM, 48.40% on short-target average versus 51.65% and 46.35%, and 47.71% overall versus 46.74% and 44.60%. Parameter-matched ablations on Wikitext-103 show PPL = 178 for causal only, PPL Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},6 for a 185M MLP adapter, PPL Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},7 for a 185M left-to-right Mamba, and PPL Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},8 for the 185M reverse Mamba sidecar, compared with bidirectional PPL Pb(vu)=W~v,u(k),P_b(v\mid u)=\tilde W^{(k)}_{v,u},9 (Chen et al., 26 Jun 2026).

This suggests that asymmetric bidirectional context can function as an efficiency-preserving decomposition: precision is retained on the left through causal attention, while useful but compressed right information is supplied through a non-attention path.

4. Search, planning, and multimodal perception

In AIT* and EIT, asymmetry is procedural rather than representational. Both algorithms maintain a random geometric graph and grow a forward tree W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell0 from the start and a reverse tree W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell1 from the goal. The reverse tree is deliberately cheaper: it uses coarse collision checks or none at all to compute admissible cost-to-go heuristics W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell2, while the forward tree performs full collision checking and cost evaluation. The reverse search therefore informs the forward search continuously, but the two directions do not play the same role. Under the usual random geometric graph assumptions, including strong-W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell3 clearance and cost bounded by path length, AIT and EIT* are almost-surely asymptotically optimal. Experiments on twelve problems show that EIT* outperforms baselines on obstacle-clearance objectives, where a priori heuristics are often ineffective, while AIT* matches or slightly outperforms BIT* on path length (Strub et al., 2021).

Bi-CMPStereo uses a different asymmetry: bidirectional prompting across modalities. It contains two complementary branches. In evCMPStereo, events are treated as the target domain and frames as the source; in imgCMPStereo, frames are the target and events the source. A Cross-Domain Embedding Adapter projects the source into a coarse target-style embedding, target and adapted-source encoders map both into a shared canonical latent space, and a shared stereo decoder produces multi-scale features. After pre-training, both branches are frozen and used as cross-modal prompt providers; their cost volumes are fused by channel-wise concatenation at W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell4 and W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell5 scales and by a small 3D-hourglass network at W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell6 scale (Xu et al., 16 Apr 2026).

The empirical motivation is that the two directional promptings preserve different cues. The paper states that alternating the target space prevents color and detail cues from frames and high-frequency temporal edges from events from being marginalized. On DSEC, fusing the two branches reduces MAE from 0.565 to 0.532 and 1PE from 11.43% to 10.61%. Ablations report MAE increases of +0.018 without CDEA, +0.024 without the stereo canonicalization constraint, and +0.023 without cascaded levels; in cross-dataset DSECW(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell7MVSEC tests, disabling HVT raises 2PE from 32.12% to 36.00% (Xu et al., 16 Apr 2026).

These examples show that asymmetric bidirectional context can arise from asymmetric computational budgets, asymmetric validation costs, or asymmetric modality roles, not only from directional sequence masks.

5. Dynamical systems and causal inference

Vehicle-platoon control provides a precise control-theoretic treatment of bidirectional asymmetry. Monteil et al. consider heterogeneous nonlinear platoons in which each vehicle is coupled to its predecessor, its follower with weight W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell8, and optionally the leader. Here W(k)==1kαAW^{(k)}=\sum_{\ell=1}^k \alpha_\ell A^\ell9 is predecessor–follower, kk0 is symmetric bidirectional, and kk1 is asymmetric coupling. Under conditions C1–C3 on vanishing couplings at the desired trajectory, contraction of the self-dynamics, and bounds on cross-coupling Jacobians, they derive the ISS-type bound

kk2

In a numerical example with kk3 vehicles, the worst-case position deviation peak is about 2.2 m for kk4 versus 1.9 m for kk5, and the speed deviation peak is 1.9 m/s versus 1.7 m/s (Monteil et al., 2018).

A complementary linear analysis distinguishes where asymmetry is placed. With symmetric position coupling and asymmetric velocity coupling, the smallest damping singular value scales as kk6 and disturbance amplification as kk7. With symmetric coupling in both position and velocity, the scaling is kk8. When position coupling is asymmetric, the smallest singular value becomes exponentially small and the disturbance bound grows as kk9; for some vv0, even finite strings can become unstable. The design guideline given is to keep position coupling symmetric and use only velocity asymmetry to improve disturbance amplification (Herman et al., 2016).

A third platoon result shows that asymmetry can restore string stability only under restricted disturbance models. If vv1 with vv2 and vv3, then an asymmetric bidirectional controller achieves vv4 string stability when disturbances act only on a fixed number of leading vehicles, independent of platoon length. The same paper proves that no choice of gains can make both directional flow gains simultaneously bounded by vv5 for arbitrary disturbance distributions (Farnam et al., 2016).

In urban systems, the term appears as asymmetric bidirectional causality rather than control. A spatio-temporal weighted regression produces localized urban composite indicators, and spatio-temporal convergent cross-mapping estimates causal skill in both directions. The asymmetry index is

vv6

where vv7 denotes an urban-system component and vv8 traffic dynamics. Across 30 cities on rest days, the reported means are 0.52 versus 0.43 for structure, 0.57 versus 0.49 for form, and 0.55 versus 0.46 for function, yielding average asymmetries of +0.09, +0.08, and +0.09. The study also identifies three city archetypes: tightly coupled, pattern-heterogeneous, and workday-attenuated (Zhang et al., 29 Oct 2025).

Across these dynamical settings, asymmetry is beneficial only when it is aligned with the mechanism of propagation. Velocity asymmetry can improve scaling; position asymmetry can produce exponential amplification; causal influence can be bidirectional while still directionally imbalanced.

6. Failure modes, security implications, and recurring limits

The most direct failure mode is broken reciprocity. In the entanglement-based two-way clock-synchronization protocol, the standard midpoint estimate

vv9

is valid only if uu0. An attack using optical circulators inserts an asymmetry

uu1

so that

uu2

The attack succeeds because the Bell state is unchanged up to a global sign after the two circulators. Experimentally, full state tomography reports fidelities uu3 before and uu4 after insertion, so the protocol’s entanglement-based tampering test does not detect the asymmetry (Lee et al., 2019).

In secure bidirectional relaying, the limiting factor is asymmetric channel gain rather than delay. With nested-lattice compute-and-forward, secrecy is achievable only when the gain ratio is rational, uu5, with co-prime integers uu6 and additional group-order constraints on the quotient lattice. If uu7 is irrational, then in the noiseless limit the relay observation uu8 uniquely determines uu9, so secrecy fails. The admissible shaping parameter shrinks as asymmetry grows, and the numerical illustration with uu0 and uu1 gives perfect-secrecy rate bounds of about 1.19 bits/dim for ratio uu2, 0.60 bits/dim for uu3, and 0.19 bits/dim for uu4 (Vatedka et al., 2015).

A further misconception is that “more bidirectionality” is always better. In pre-training, pure causal models remain best for next-token prediction and zero-shot priming, while fully bidirectional masked models are best for infilling and fine-tuning, and hybrid causal-plus-masked models underperform both extremes at all scales studied. In platoons, asymmetry in velocity coupling improves scaling, but asymmetry in position coupling can cause exponential growth or instability. In synchronization, unmeasured asymmetry invalidates the midpoint estimator altogether (Artetxe et al., 2022, Herman et al., 2016, Lee et al., 2019).

A plausible implication is that asymmetric bidirectional context is best understood as a constrained decomposition of two-way dependence. It is useful when the decomposition respects the problem’s operational asymmetries—forward versus backward walk statistics, left versus right token utility, heuristic computation versus edge validation, or urban influence versus traffic feedback—but it is harmful when it violates hidden reciprocity assumptions or places asymmetry in the wrong channel.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Bidirectional Context.