Qwen3.5-35B-A3B: Hybrid MoE LLM

Updated 23 June 2026

Qwen3.5-35B-A3B is a 35-billion-parameter hybrid Mixture-of-Experts language model integrating GatedDeltaNet and multi-head attention for advanced sequential modeling.
It employs a unique memory grafting framework that enables scalable, frozen conditional memory retrieval, boosting performance across coding and language tasks.
The model excels in agentic behavior steering, safety signal detection, and vision-language multi-hop reasoning, enhancing outcomes in diverse application domains.

Qwen3.5-35B-A3B is a 35-billion-parameter, hybrid Mixture-of-Experts (MoE) LLM distinguished by its GatedDeltaNet/attention block structure and broad use as a frozen, reusable backbone in diverse research lines. Its core architecture serves roles spanning conditional memory bank construction, safety-critical agentics signal detection, agentic behavioral steering, test-driven code development, and vision-language multi-hop reasoning.

1. Architecture and Design

Qwen3.5-35B-A3B is a decoder-only Transformer composed of 40 layers grouped into 10 blocks, each block comprising three GatedDeltaNet (linear-recurrence) layers followed by a multi-head attention layer. The residual stream width is 2048, with 16 to 32 attention heads depending on task configuration.

The model incorporates MoE layers, each with 256 experts. At each token, a learned gating network selects 8 experts and one shared expert, routing the token through their respective FFNs (intermediate dimension 512). The gating at each layer operates as

$ḡ_t = G_\ell \cdot x_{\ell-1,t} + b_g, \quad w_t = \mathrm{softmax}(ḡ_t)$

with top-8 selection and renormalization, allowing fine-grained expert activation per token.

GatedDeltaNet layers introduce delta-rule recurrence:

$u_t = \sigma(W_u x_{\ell-1,t} + b_u), \quad c_t = \tanh(W_c x_{\ell-1,t} + b_c), \quad h_{\ell, t} = u_t \odot h_{\ell, t-1} + (1-u_t)\odot c_t, \quad x_{\ell, t} = \mathrm{LayerNorm}(x_{\ell-1, t} + h_{\ell, t})$

providing enhanced sequential modeling versus standard attention-only stacks.

Pretraining utilized mixed Chinese/English corpora, code, dialog, and web text. Optimization followed AdamW with expert-load balancing loss, leveraging large batch sizes and regularization to enable stable MoE routing and construction of robust internal representations (Yap, 17 Mar 2026).

2. Memory Grafting and Conditional Memory Scaling

Qwen3.5-35B-A3B is foundational in the Memory Grafting framework, serving as the “grafting” model for scalable, frozen external conditional memory (Cheng et al., 20 May 2026). The process operates as follows:

Offline Grafting Pass: The model is run once, offline, over a fixed set of frequent 2-, 3-, 4-grams. For each n-gram, final-token hidden states are extracted at layers 8 and 24 (10% and 60% depth).
Frozen Memory Table Construction: Each n-gram tuple becomes an exact lookup key. The corresponding hidden state is stored as a bfloat16 embedding in a table (~3 million entries per layer).
Recipient Model Integration: During recipient model training and inference, the memory table is retrieved via exact longest-match suffix queries (O(1) complexity via hashing/trie), supplanting the need for large, trainable Engram tables.
Fallback and Adaptation: Misses revert to an Engram-style hash-based memory, ensuring coverage for rare/novel contexts. Retrieved values are linearly projected and gated before merging with recipient states.

Quantitatively, in the 2.8B recipient setting, memory grafting (using Qwen3.5) raises the average benchmark score from 51.95 (MoE) and 52.43 (vanilla Engram) to 53.86, with the largest gains on LAMBADA (+5.2 points) and BoolQ (+4.6 points). The 0.92B scale recipient sees smaller but consistent gains. Lookup overhead is minimal, with negligible impact on throughput or memory, and offline memory extraction is computationally inexpensive (~3 GPU-hours per 3M entries) (Cheng et al., 20 May 2026).

3. Agentic Behavior and Behavioral Steering

Qwen3.5-35B-A3B’s activation space enables fine-grained behavioral manipulation via probe-steered vectors decoded from sparse autoencoders (SAEs) (Yap, 17 Mar 2026). The workflow:

Train overcomplete SAEs on residuals at several depths.
Fit linear probes on SAE latents for agentic traits (Autonomy, Tool-use eagerness, Persistence, Risk calibration, Deference).
Project probe weights through the SAE decoder, yielding continuous steering vectors applied during inference.

Key results: Autonomy steering at multiplier α=2.0 yields Cohen’s d = 1.01 (p < 0.0001), shifting behavior strongly toward proactive tool execution (ask_user: 78%→5%, web_search: 22%→44%, code_execute: 0%→48%). All five behavioral directions collapse onto a single “agency axis,” with trait-specific modulations as secondary effects. Only interventions applied during prompt prefill, not generation, have any effect—commitments are set in GatedDeltaNet state during prefill. This causal dissociation holds for all traits tested (Yap, 17 Mar 2026).

4. Safety and Failure Signal: Strained Coherence

Qwen3.5-35B-A3B has been rigorously analyzed for "strained coherence," a pre-failure safety signal where the agent explicitly acknowledges a conflict but does not resolve it before acting (Pandya et al., 5 Jun 2026). The operational definition requires a trajectory T = (s₁, …, sₙ) to contain a step acknowledging a conflict, with no resolving action in subsequent steps.

Application of a Claude Sonnet 4.6–based detector to 44 Terminal-bench-2 execution traces using Qwen3.5-35B-A3B shows:

Flagged trajectories fail 94% of the time (15/16), while unflagged fail 46% (13/28), a 47-point gap (p = 0.003).
Matched-precision (selectivity k=16): 94% for the detector, 88% for a lexical baseline; their intersection (n=10) fails 100% of the time (95% CI [69%, 100%]).
The earliest detectable flag appears late in the trajectory (median 84% of reasoning steps elapsed).
Representative patterns involve mechanical code fixes or repeated failed strategies despite explicit acknowledgment of unsatisfiable constraints—exposing failures in rational uncertainty handling or proxy optimization.

Replication with Gemma4-31B confirms directionality but shows attenuation when "think" content is absent, highlighting detectability dependence on reasoning verbosity. As a mitigation, requiring explicit resolution after acknowledgment, or upweighting such signals in preference-based RL, is suggested for safety-critical deployments (Pandya et al., 5 Jun 2026).

5. Code Generation and Regression Mitigation via TDAD

On software maintenance benchmarks, Qwen3.5-35B-A3B operates within the TDAD (Test-Driven Agentic Development) protocol to control regressions introduced by code agents (Alonso, 18 Mar 2026). TDAD provides:

Code–Test Graph Construction: AST-based graphs with File, Function, Class, and Test nodes, and structural/call-edge relationships. Each candidate patch is mapped to affected tests via weighted impact scoring, prioritizing high-impact tests.
Workflow: For each GitHub issue, the agent applies a patch, infers which tests are highest-impact via the graph, runs those tests, and repairs as needed.
Performance: On 25 SWE-bench Verified instances, the TDAD skill increases resolution (24%→32%), patch generation (40%→68%), and uniquely achieves a test-level regression rate of 0%. TDD prompting alone increases regressions (6.08%→9.94% in the 30B setting), while the TDAD workflow leverages concise context to halve regressions over vanilla baselines.
Auto-Improvement Loop: Autonomous refinement of TDAD’s skill yields 60% resolution with 0% regression after 15 iterations on 10 instances.

TDAD demonstrates that surfacing code–test context outperforms long procedural prompts, particularly in small/medium models where token window efficiency is paramount. For benchmarking, composite metrics penalizing regressions more than missed resolutions are proposed (Alonso, 18 Mar 2026).

6. Vision-Language Multi-Hop Reasoning and HopChain RLVR

Qwen3.5-35B-A3B serves as the backbone for RLVR (reinforcement learning with verifiable rewards) training with HopChain, facilitating generalizable vision-language reasoning tasks with multi-hop complexity (Wang et al., 17 Mar 2026). The architecture attaches a standard vision encoder to the LLM; no architectural changes are made for multi-hop data integration.

HopChain Synthesis and Training:

Synthesizes chains of 3–6 vision-language hops, each dependent on the previous, ending with a verifiable numeric answer.
Integrates multi-hop instances into RLVR minibatches using Soft Adaptive Policy Optimization (SAPO).

Empirical results across 24 benchmarks:

RLVR with HopChain improves average accuracy: STEM/Puzzle (+1.6), General VQA (+1.2), Text/Doc (+1.8), Video (+1.4), Overall (+1.2).
Largest gains manifest in ultra-long-CoT tasks (responses >200 tokens), with gains >50 points in some STEM/video settings.
Error distribution shifts: HopChain-augmented models show corrected perception, reasoning, knowledge, and hallucination errors, with chain structure forcing repeated attention to visual evidence through each hop (Wang et al., 17 Mar 2026).

7. Conclusion

Qwen3.5-35B-A3B exemplifies a highly flexible 35B-parameter MoE + GatedDeltaNet/attention LLM supporting, as a frozen backbone, advanced memory augmentation, interpretable safety monitoring, agentic behavioral steering, coding-agent regression mitigation, and robust vision-language multi-hop reasoning. Across contextual coding, vision-language benchmarks, and intervention studies, it demonstrates (a) capacity for reusability at the representation level, (b) precise behavioral controllability, and (c) capacity extension far beyond its native parameter count via memory grafting and data-augmented RL. These results establish a modern archetype for high-capacity, versatile LLMs deployable far beyond their initial pretraining trajectory (Cheng et al., 20 May 2026, Pandya et al., 5 Jun 2026, Alonso, 18 Mar 2026, Yap, 17 Mar 2026, Wang et al., 17 Mar 2026).