Differentiable Adaptive Computation Time (DACT)

Updated 23 February 2026

DACT is a mechanism that allows neural networks to dynamically adjust the number of computation steps using differentiable halting units.
It employs soft, sigmoidal gating and convex interpolation over intermediate states, integrating a computation penalty to balance efficiency and accuracy.
DACT has been applied to RNNs, CNNs, MAC modules, and Transformers, achieving reduced computation with maintained performance and interpretability.

Differentiable Adaptive Computation Time (DACT) refers to a family of mechanisms that enable neural networks to dynamically adjust their computational depth or execution steps in a fully differentiable, end-to-end trainable manner. Unlike approaches requiring discrete or sampled halting decisions (which necessitate gradient estimators such as REINFORCE), DACT methods employ soft, sigmoidal gating and convex interpolation over intermediate states, ensuring standard backpropagation can optimize both the main prediction network and its halting policy jointly. DACT has been instantiated across architectures—RNNs, CNNs (notably Residual Networks), attention-based reasoning modules, and large-scale Transformers—enabling cost/accuracy trade-offs, increased interpretability, and context-sensitive computation across both vision and language domains (Graves, 2016, Figurnov et al., 2016, Eyzaguirre et al., 2020, Eyzaguirre et al., 2021, Neumann et al., 2016, Figurnov et al., 2017).

1. Core Formulation: Halting Units and Weighted Outputs

The central element of DACT is the introduction of a halting unit at each (potentially repeated) step or block within a neural architecture. This is a scalar gating node, most commonly taking the form $h_n = \sigma(W_h u_n + b_h)$ , where $u_n$ is the relevant activation or memory at the $n$ th iteration (e.g., RNN state, residual block output, or Transformer [CLS] embedding). The halt-scores $h_n\in(0,1)$ act as per-step or per-block probabilities that accumulate to determine when computation should terminate.

At each time step or block, one computes cumulative sum or product of these $h_n$ values to determine a distribution $(p_1, \ldots, p_N)$ over when to halt. For instance, the Graves ACT/DACT variant defines $N(t)=\min\{n: \sum_{i=1}^n h_t^i \ge 1-\epsilon\}$ and constructs $p_n = h_n$ for $n<N$ , $p_{N}=R=1-\sum_{i<N}h_i$ , and zero otherwise (Graves, 2016).

The final state or output is a convex combination over all intermediate states, weighted by $p_n$ , thus

$y = \sum_{n=1}^{N} p_n y^n$

with $\sum_n p_n = 1$ (up to numerical or $\epsilon$ error). This scheme ensures full differentiability and determinism, as all gates are trained using backpropagation and no discrete sampling or non-differentiable control flow is present.

2. Loss Function and Computation-Penalty Tradeoffs

DACT algorithms augment the standard prediction loss (classification, regression, etc.) with an explicit computation-time penalty, typically called the ponder cost. Across all variants, this penalty is proportional either to the expected number of computation steps or blocks, or to the cumulative probabilities assigned to later steps. For example, the classical RNN ACT loss takes the form

$L_\text{total} = L_\text{task} + \tau \sum_t [N(t) + R(t)]$

where $\tau$ is a hyperparameter controlling the computation-accuracy trade-off (Graves, 2016, Figurnov et al., 2016, Eyzaguirre et al., 2020, Eyzaguirre et al., 2021). Larger $\tau$ pushes the network toward using fewer computational resources per example, at a potential cost to predictive performance.

Some formulations employ a probabilistic prior (e.g., truncated exponential) on the halting position and optimize a variational bound using Relaxed Concrete or Gumbel-Softmax relaxations for expected-step cost (Figurnov et al., 2017).

3. Algorithmic Implementation and Training Dynamics

DACT is implemented by augmenting the computation graph of a base network—such as an RNN, ResNet, MAC module, or Transformer stack—with per-step halting units and the logic for cumulative gate computation. During training, all steps are unrolled up to a maximum $N$ (e.g., maximum micro-steps in ACT, residual units in SACT, or Transformer layers in DACT-BERT), and the convex weighted sum over intermediate outputs is backpropagated. The halting scores are produced via lightweight MLPs, convolutions, or linear projections conditioned on the relevant hidden state or input.

Optimization uses standard SGD, Adam, or AdaGrad, and does not require gradient estimators for discrete zeros/ones, as all "halting" operations are smooth. Gradients w.r.t. halting scores are well-defined almost everywhere, except for the ignored piecewise-constant $N(t)$ indices, which contribute zero almost everywhere (Graves, 2016, Figurnov et al., 2016).

Table 1: Key Steps in DACT Algorithm

Step	Operation	Output
Compute halting unit	$h_n = \sigma(W_h u_n + b_h)$	Soft scalar $h_n$
Accumulate halt/probability	Sum or product of $h_n$ (variant-dependent)	$p_n$ or residual $p_n$
Weighted output	$y = \sum_n p_n y^n$ (intermediate states)	Final output for step/block
Add penalty	Aggregate ponder cost (e.g., $\sum_t N(t) + R(t)$ )	Added to main loss function
Backpropagation	Gradients flow through all steps (except discrete $N$ jumps)	End-to-end differentiability

4. DACT for Convolutional, Sequential, and Attention Architectures

DACT has been specialized for various network topologies:

RNN/GRU/LSTM: Input is "pondered" via micro-steps, with a sigmoid halting unit controlling how many intermediate recurrences occur per macro-input. Used for parity, logic, addition, sorting (synthetic), and language modeling (Graves, 2016).
Residual Networks/SACT: Each spatial location in each residual block computes its own halting score; blocks dynamically skip computation in locations where cumulative halt crosses $1-\epsilon$ (Figurnov et al., 2016). This spatial DACT allows localized, data-dependent depth in CNNs.
MAC for Visual Reasoning: Each reasoning step's cell (control, read, write) appends a halting unit; accumulated outputs are convexly combined, and a test-time halting condition ensures efficiency (Eyzaguirre et al., 2020).
Transformer/BERT: Scalar halting scores attached to each block condition on the [CLS] embedding; the forward pass accumulates a running convex combination of logits, with an explicit class-stability bound used to halt early (Eyzaguirre et al., 2021).

DACT's convex interpolation property enables interpretable reasoning when visualizing attention or step-wise outputs, often correlating with instance complexity or human saliency (Figurnov et al., 2016, Eyzaguirre et al., 2020, Eyzaguirre et al., 2021).

5. Experimental Results and Empirical Tradeoffs

Across domains and architectures, DACT achieves substantial reductions in average computation—steps, blocks, or layers—while preserving or improving accuracy relative to fixed-depth baselines:

ResNet (Spatial SACT): Reduces ImageNet FLOPs from $1.56\times10^{10}$ (ResNet-101) to $1.11\times10^{10}$ with almost no top-1 drop (76.0\% vs. 75.6\%) and even outperforming at higher test resolutions (Figurnov et al., 2016).
Object Detection (COCO): SACT feature extractor yields higher mAP at fewer FLOPs than comparable static networks.
Visual Saliency: Learned ponder maps correlate with human eye fixations on the CAT2000 benchmark (AUC 84.7\% vs. 83.4\% center-only) (Figurnov et al., 2016).
RNNs: On synthetic tasks, ACT/DACT achieves adaptive runtimes matching problem complexity and interpretable allocation of computation (e.g., more pondering on ambiguous text characters or segment boundaries) (Graves, 2016).
Visual Reasoning (CLEVR/GQA): DACT-MAC attains 98.7\% accuracy with 5 adaptive steps, outperforming non-adaptive fixed-step networks at same or lower computation (Eyzaguirre et al., 2020).
BERT-based Transformers: DACT-BERT matches or exceeds the performance of methods such as DeeBERT or PaBEE in low-compute regimes, achieving similar accuracy with only half the layers (e.g., $\sim$ 6/12) (Eyzaguirre et al., 2021). The area under the compute-accuracy curve is systematically better on multiple GLUE tasks.

6. Probabilistic and Relaxed Extensions of DACT

Beyond deterministic convex interpolations, DACT has been reframed probabilistically by introducing latent halting variables $z_k$ (one per block), endowed with truncated geometric priors penalizing long runtimes (Figurnov et al., 2017). Learning proceeds via amortized MAP inference using stochastic variational optimization, with the Concrete (Gumbel-Softmax) relaxation enabling differentiable surrogates for these discrete halting variables. At test time, a thresholding policy yields efficient deterministic paths. This probabilistic perspective quantifies model uncertainty about required computation and enables amortized inference over input-dependent runtimes.

7. Limitations and Future Directions

Limitations noted include conservative test-time halting rules (which may overrun steps), evaluation focused on particular architectures (e.g., MAC, RNNs, ResNets, BERT), and the need for careful hyperparameter tuning (ponder cost, sigmoid biases, $\epsilon$ thresholds). Open areas include generalization to arbitrary modular/topological structures (e.g., deeper CNN/Transformer hybrids), richer penalty forms (nonlinear ponder costs), integration with mixture-of-experts routing, and joint exploitation of conditional halting and attention mechanisms (Figurnov et al., 2016, Eyzaguirre et al., 2020, Figurnov et al., 2017).

A plausible implication is that DACT's interpolative and differentiable character can be universally integrated into any architecture where a sequence of computational modules admits per-step gating, provided convex imposture over outputs is meaningful for the target task. This suggests future architectures may unify sparse/compositional computation and dynamic depth within a single differentiable framework, simultaneously advancing efficiency, interpretability, and performance.

Markdown Report Issue Upgrade to Chat

References (6)

Adaptive Computation Time for Recurrent Neural Networks (2016)

Spatially Adaptive Computation Time for Residual Networks (2016)

Differentiable Adaptive Computation Time for Visual Reasoning (2020)

DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference (2021)

Learning to Reason With Adaptive Computation (2016)

Probabilistic Adaptive Computation Time (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differentiable Adaptive Computation Time (DACT).

Differentiable Adaptive Computation Time (DACT)

1. Core Formulation: Halting Units and Weighted Outputs

2. Loss Function and Computation-Penalty Tradeoffs

3. Algorithmic Implementation and Training Dynamics

Table 1: Key Steps in DACT Algorithm

4. DACT for Convolutional, Sequential, and Attention Architectures

5. Experimental Results and Empirical Tradeoffs

6. Probabilistic and Relaxed Extensions of DACT

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Differentiable Adaptive Computation Time (DACT)

1. Core Formulation: Halting Units and Weighted Outputs

2. Loss Function and Computation-Penalty Tradeoffs

3. Algorithmic Implementation and Training Dynamics

Table 1: Key Steps in DACT Algorithm

4. DACT for Convolutional, Sequential, and Attention Architectures

5. Experimental Results and Empirical Tradeoffs

6. Probabilistic and Relaxed Extensions of DACT

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research