Papers
Topics
Authors
Recent
2000 character limit reached

Subjective Depth Transformers (SDT)

Updated 3 December 2025
  • Subjective Depth Transformers (SDT) are dynamic Transformer models that use learnable, data-driven gating to select computational layers, enhancing both efficiency and task performance.
  • They employ probabilistic layer selection via Bernoulli latents and variational training with ELBO objectives, enabling precise control over active depth.
  • SDT variants leverage Bayesian surprise routing and intent-guided strategies, adapting seamlessly to multilingual sequence modeling and vision-based depth estimation.

Subjective Depth Transformers (SDT) denote a class of Transformer architectures in which depth selection—in terms of computational layer execution or depth perception—is governed by learnable, data- or intent-driven routing signals rather than fixed, static computation schedules. SDT models span sequence modeling, conditional computation, and intent-guided depth estimation, unifying diverse advances in dynamic model selection and active perception.

1. Foundational Principles and Probabilistic Layer Selection

SDT frameworks in sequence modeling formalize layer selection as inference over Bernoulli latents, each indicating whether a layer is executed or skipped for a given input. For an LL-layer Transformer, SDT introduces binary variables zl{0,1}z_l \in \{0, 1\} per layer, modeling the generative process: z=(z0,...,zL1)p(z),xl+1=xl+zlFl(xl),ypΘ(yx,z)z = (z_0, ..., z_{L-1}) \sim p(z), \quad x_{l+1} = x_l + z_l \cdot F_l(x_l), \quad y \sim p_\Theta(y|x, z) The prior p(z)p(z) can be factorized as independent Bernoullis with optional Beta hyperprior on the selection probabilities; during training, the amortized posterior qϕ(zx)q_\phi(z|x) factorizes and is parameterized by learned logits, enabling input-conditional layer selection. Differentiable sampling is achieved via the Gumbel-Softmax/Concrete relaxation, ensuring full gradient flow without need for REINFORCE estimators (Li et al., 2020).

2. Variational Training and Depth Control Objectives

Optimization proceeds via the evidence lower bound (ELBO) on the marginal log-likelihood, enforcing fidelity between learned gating and priors: logp(yx)Eqϕ(zx)[logpΘ(yx,z)]KL(qϕ(zx)p(z))\log p(y|x) \geq \mathbb{E}_{q_\phi(z|x)}[\log p_\Theta(y|x, z)] - \mathrm{KL}(q_\phi(z|x)\|p(z)) Depth targets are promoted using an auxiliary penalty for controlling the expected number of active layers (KK): LK=Eqϕ[l=0L1zl]K22L_K = \| \mathbb{E}_{q_\phi}[ \sum_{l=0}^{L-1} z_l ] - K \|^2_2 Final objective aggregates reconstruction, KL regularization, and depth control: Ltotal=Eqϕ(zx)[logpΘ(yx,z)]+βKL(qϕ(zx)p(z))+λLKL_{\text{total}} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\Theta(y|x,z)] + \beta\cdot \mathrm{KL}(q_\phi(z|x)\|p(z)) + \lambda L_K Here, β\beta and λ\lambda are tuned to balance fidelity and regularization (Li et al., 2020).

3. Bayesian Surprise Routing and Dynamic Depth Allocation

Recent SDT variants in conditional computation adopt Bayesian surprise as a routing criterion, recognizing that uniform compute allocation is suboptimal for large-scale and long-context tasks. Each layer alternates between a Decision layer, which computes both a full Transformer block (posterior) and a lightweight prior residual via an auxiliary MLP, and a Dynamic layer, which employs surprise scores—quantified as surrogate KL divergences via squared L2L_2 distances between actual and predicted residuals—to produce a gating signal: gcont,t=σ(βceCEt)+σ(βcuCUt)σ(βceCEt)σ(βcuCUt)g_{\mathrm{cont},t} = \sigma(\beta_{ce} \mathrm{CE}_t) + \sigma(\beta_{cu} \mathrm{CU}_t) - \sigma(\beta_{ce} \mathrm{CE}_t)\sigma(\beta_{cu} \mathrm{CU}_t) Top-K routing selects a fraction γ\gamma of tokens, enforcing static graph constraints and compute predictability. Over training, gating criteria evolve from novelty detection (unexpected change) to prediction-driven routing (expected change), consistent with predictive coding theory (Wieser et al., 26 Nov 2025).

4. Multilingual and Intent-Conditioned Extensions

SDT’s methodology generalizes across tasks and modalities. In multilingual sequence modeling, each language pair nn is assigned a distinct inference network qϕ(n)(zx)q_\phi^{(n)}(z|x), capturing language-specific layer usage. Aggregated posteriors serve as shared priors to promote cross-language structure, penalizing deviations in layer selection patterns via KL divergence: q~(z)=1Nn=1Nqϕ(n)(zx(n)),KL(qϕ(n)(z)q~(z))\widetilde{q}(z) = \frac{1}{N} \sum_{n=1}^N q_\phi^{(n)}(z|x^{(n)}), \quad \mathrm{KL}(q_\phi^{(n)}(z|\cdot) \| \widetilde{q}(z)) Low-resource languages learn to gate out more layers, reducing overfitting, while complex tasks retain deeper processing. Each language’s gradients are normalized by stochastic gating, improving training stability for deep stacks (Li et al., 2020).

In vision, SDT reinterprets depth estimation as an intent-driven process. DepthFocus introduces a “depth preference” scalar zz steering focus across layered ambiguities in stereo matching. The model learns dz=fθ(IL,IR;z)d_z = f_\theta(I_L, I_R; z), with depth prediction modulated by zz via conditional mixture-of-experts routing and direct condition injection, thereby producing a family of subjective depth maps. Intent alignment is quantified by the rank correlation between zz and estimated depths; in benchmark datasets, ρ0.92\rho \approx 0.92 indicates effective traversal of layered surfaces (Min et al., 21 Nov 2025).

5. Empirical Performance and Efficiency

SDT variants demonstrate consistent improvements in gradient health and generalization across settings. Classical SDT eliminates vanishing-gradient phenomena in deep Transformers, enabling stable training with up to 100 layers (Figure 3b), outperforming static baselines and LayerDrop/FixUp solutions. On WMT’16 English–German, SDT achieves higher BLEU scores than Transformer-Big, and exhibits robust masked language modeling performance for 25-language Wikipedia. Multilingual SDT exhibits adaptive layer usage and universal improvements over both “wide” models and static baselines (Li et al., 2020).

In conditional-computation SDT, per-layer FLOP and memory reductions are substantial: up to 75% less self-attention compute and 50% reduction in key-value cache for each compute-skipping layer. At γ=0.5\gamma = 0.5, self-attention FLOPs drop to 62.5% of dense computation. Accuracy-compute trade-offs are explicit, e.g., MMLU scores reduced from 55.9% (dense) to 24.4% (SDT, fixed) for a 0.5B model, but with ~37.5% savings (Wieser et al., 26 Nov 2025). Lightweight priors suffice for effective routing; full fine-tuning outperforms LoRA for model adaptation.

Depth estimation SDT matches or exceeds state-of-the-art on BOOSTER and synthetic multi-layer benchmarks, delivering strong performance in intent-driven recovery and layered scene generalization, both on synthetic and real data acquired with transmissive plates (Min et al., 21 Nov 2025).

6. Architectural Variants and Practical Implications

SDT embodies several architectural directions:

Variant Domain Core Mechanism
Probabilistic SDT NLP, MT, LM Bernoulli gating w/ ELBO
Bayesian-routing SDT Conditional compute Top-K surprise-driven routing
Intent-guided SDT Vision (stereo) Scalar-conditioned MoE & DCI

Applications span long-context modeling (where many tokens can be skipped), real-time/low-latency inference, multi-modal architectures needing dynamic depth allocation, and see-through 3D perception in transmissive environments. Fixed-capacity routing supports deployment on hardware with static graph requirements (Wieser et al., 26 Nov 2025), while intent-aligned SDT opens pathways to active depth estimation, competitive with human perceptual focus (Min et al., 21 Nov 2025).

7. Limitations and Future Directions

SDT inherits several limitations from its underlying mechanisms. In vision, ambiguous or closely spaced transmissive layers challenge routing reliability, especially when stereo parallax is insufficient. Extremely low-contrast see-through materials may blur expert selection. SDT’s current memory and efficiency benefits trade off against accuracy, especially under aggressive compute reduction (as shown in Table 1, (Wieser et al., 26 Nov 2025)). A plausible implication is that future SDT variants may incorporate explicit reflection modeling, temporal cues, or learnable attention patterns indexed by depth or surprise criteria.

Empirical findings suggest SDT models' training dynamics are aligned with predictive-coding theories, transitioning from novelty-driven gating to prediction-error minimization. Potential future work includes dynamic temporal gating (as in STT), more granular multi-modal extensions, and integration with hardware-aware model design.


Subjective Depth Transformers unify probabilistic, surprise-driven, and intent-conditioned computation in Transformer architectures, yielding dynamic depth allocation for both sequence models and active perception. These advances enable scalable, efficient, and flexible deep learning systems across language, vision, and multi-modal domains (Li et al., 2020, Wieser et al., 26 Nov 2025, Min et al., 21 Nov 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Subjective Depth Transformers (SDT).