Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

Published 2 Apr 2026 in cs.LG and cs.CL | (2604.02051v1)

Abstract: Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents an input-conditioned Controller hypernetwork that modulates fixed, SVD-derived LoRA bases to enhance recursive transformer performance.
It demonstrates up to a 43.4% loss reduction and recovers over 51% of the performance gap with only 0.6% additional trainable parameters.
The study highlights generalization challenges on unseen data due to frozen downstream layers, suggesting the need for adaptive techniques.

Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

Introduction and Motivation

Ouroboros introduces a novel mechanism for augmenting recursive transformers via input-conditioned, dynamically generated weight modulation. Recursive transformers address the efficiency bottlenecks of large transformer models by reusing a single shared block across multiple steps. While this architecture drastically reduces parameter counts, it imposes a rigid uniformity: each recurrence step applies the same transformation, which limits the capacity for hierarchical or compositional computation across the pseudo-depth. Existing static modifications, such as pre-step adapters or LoRA signals, are fixed per step and cannot adapt to individual inputs—limiting their expressiveness.

Ouroboros remedies this by attaching a compact input-conditioned hypernetwork ("Controller") to the recurrent block. This Controller observes the mean-pooled hidden state at every depth step, and parametrizes diagonal modulation vectors for fixed, SVD-initialized LoRA bases. Each recurrence step becomes dynamically specialized, contingent on both the input sequence and the state of computation so far.

Architectural Overview

The architecture decomposes a pretrained transformer into three contiguous segments:

Prelude: Initial frozen layers, mapping embeddings into a reasoning space.
Recurrent Block: Single shared layer, applied $N$ times; LoRA-modulated per step.
Coda: Frozen terminal layers, decoding the latent space to token logits.

Crucially, layers removed in compressing the base model to this recurrent form have their knowledge distilled via SVD into fixed LoRA bases. The Controller only learns how to weight (modulate) these bases per input and per depth step.

The Controller is realized as a lightweight MLP, which processes the concatenated mean-pooled hidden state and a step embedding, producing modulation vectors for each LoRA target (attention projections and FFN sublayers). Gated recurrence, bias-initialized to yield 88% retention of the previous hidden state, preserves information across steps and stabilizes deep iteration. Each recurrence step also uses a unique LayerNorm, as static normalization was found insufficient.

Technical Innovations

Key methodological contributions of Ouroboros are:

Controller Hypernetwork: Generates step- and input-conditioned LoRA signals for a recurrent transformer block, with minimal parameter overhead (0.6% of the base model).
SVD-Initialized LoRA Bases: By averaging weight residuals of removed layers and computing a truncated SVD, the LoRA bases encode principal axes of layer-wise variability, held fixed and only modulated.
Gated Recurrence: The critical role of residual gating is empirically demonstrated; iterative layer application without a gate increases loss, while gating recovers performance.
Minimal Parameter Footprint: Only 9.2M parameters (Controller, gate, per-step norms) are trained, with the remainder of the model (including embeddings and output head) frozen.
Dynamic Latent Reasoning: Hidden-state-dependent LoRA modulation allows dynamic, context-sensitive iterative reasoning in hidden space, extending prior work on adaptive computation in transformers.

Empirical Results

Controller vs. Static LoRA Ablations

The main ablation contrasts input-conditioned (Controller-generated) and static per-step LoRA modulation for the recursive block. On Qwen2.5-3B, reduced to a 17-layer Prelude/Recurrent/Coda configuration, Ouroboros' Controller yields a 43.4% reduction in training loss relative to the unmodified 17-layer baseline, recovering 51.3% of the gap to the original 36-layer model.

Controller outperforms static per-step LoRA in all tested configurations and depths, with the largest margin observed at depth 1 (a 1.44 loss point improvement). While increasing recurrence steps allows static LoRA more flexibility, Controller-based modulation maintains an edge across depths (1, 4, 8, 16) and LoRA ranks (8, 32, 64).

Performance is robust to hyperparameter variations; the Controller converges to a narrow loss band ( $\approx 5.08$ ) across depths, LoRA ranks, and learning rates. Notably, training loss does not improve with increased depth, suggesting that a single dynamically modulated pass suffices under the current setup and that additional recurrence is highly dampened by the gate.

Necessity of Gated Recurrence

Without gated recurrence, iterative application of the recurrent block degrades performance; loss increases over the baseline. Only when employing identity-biased gating does the model benefit from recursive computation, aligning with independent findings on the instability of ungated recurrence in transformers.

Limitations: Generalization Gap

A substantial limitation arises on held-out evaluation: the Controller-driven system does not transfer its performance gains to unseen data. The likely failure mode is representational mismatch—the frozen coda layers expect pre-recursion hidden state statistics which are shifted by the Controller's modulations. Controlled experiments with regularization and constrained LoRA scaling marginally reduce but do not eliminate this gap.

Theoretical and Practical Implications

Ouroboros establishes the effectiveness of input- and step-conditioned weight modulation in recursive transformers, demonstrating substantial reductions in parameter count while partially bridging the performance deficit associated with depth compression. The approach confirms that much of a transformer's hierarchical computation can be compactly represented as data-dependent traversal over learned low-rank directions. The system's strong performance with only 0.6% additional parameters points to promising directions in modular, hypernetwork-augmented models for resource-constrained inference.

However, the failure to generalize suggests a bottleneck in freezing downstream layers, raising theoretical questions about the interplay between recurrent latent adaptation and decoder adaptation. Future model editing techniques or semi-frozen fine-tuning could address this representational drift.

The depth invariance observed indicates that, in its current configuration, adaptive, input-dependent halting (i.e., allocating deeper computation to harder inputs) is not yet realized—rather, the Controller front-loads all available computation into the first recurrence. Practical deployments can leverage this by minimizing computational cost, but broader latent reasoning capacity may require modified gating, halting, or more expressive Controller architectures.

Future Directions

Unfreezing Coda Layers: Allowing downstream adaptation or parametrizing lightweight adapters may resolve the generalization bottleneck.
Adaptive Halting: Exploration into learned computation allocation per input could enable true adaptive reasoning advantages.
Scaling Studies: Assessing Controller-vs-static modulation at larger scales (e.g., 7B, 70B models) will clarify the method's applicability.
Benchmarking: Measuring end-task performance (e.g., on GSM8K, ARC-Challenge) is required to establish practical gains beyond intrinsic loss metrics.

Conclusion

Ouroboros demonstrates that a compact, input-conditioned Controller hypernetwork can drive effective, training-distribution-specific improvement in recursive transformers by dynamically modulating fixed, SVD-derived LoRA bases. The system achieves significant loss reductions over static and parameter-matched baselines, with practical parameter efficiency. Full generalization and effective use of deep reasoning require further architectural adaptation, particularly in the unfrozen adaptation of downstream layers and integration of adaptive computation mechanisms. The open-source release facilitates extension and benchmarking in broader LLM adaptation research.

Markdown Report Issue