Activation Consistency Training (ACT)

Updated 4 November 2025

Activation Consistency Training (ACT) is a self-supervised approach that enforces invariant internal representations across input transformations.
ACT regularizes layer-level activations using techniques like prompt augmentation and gradient-based minimization to enhance model safety and alignment.
Empirical studies show that ACT improves sycophancy avoidance in large language models and verification performance in convolutional networks.

Activation Consistency Training (ACT) denotes a family of self-supervised algorithms that enforce invariance in a neural network's internal representations (activations), rather than only its external outputs, against certain classes of input transformations or perturbations. ACT’s principal aim is mechanistic robustness: the network should process semantically equivalent or adversarially transformed inputs in an internally indistinguishable manner, thereby improving model safety, verifiability, and resistance to manipulative cues—without relying on data- or label-heavy supervision.

1. Conceptual Foundations: Activation Invariance and Robustness

Activation Consistency Training originated from the recognition that alignment failures and vulnerabilities in neural models often arise from discrepancies in internal representation induced by irrelevant changes in inputs. While traditional consistency training enforces invariance at the output level (for example, via outputs on clean vs. transformed prompts), ACT regularizes the model’s latent computations, specifically the layer-level activations or neuron states. For LLMs, this entails enforcing that residual stream activations computed for a core prompt remain similar, independent of surrounding sycophantic or jailbreak text (Irpan et al., 31 Oct 2025). In vision and generic network settings, ACT encourages stable neuron activation patterns under small input perturbations (Liu et al., 17 Dec 2024).

ACT generalizes across domains: in security-verification, reduced neuron instability makes networks amenable for formal certification (Liu et al., 17 Dec 2024); in LLM alignment, ACT reduces undesirable prompt sensitivity and manipulative response behaviors without specifying explicit output targets (Irpan et al., 31 Oct 2025). The underlying hypothesis is that robust internal representations prevent external output failures.

2. Algorithmic Formulation and Mathematical Structure

(a) Mechanism in LLMs

For transformers, let $h_{\theta, t, l}(p)$ denote the residual stream activation at layer $l$ , position $t$ , for prompt $p$ . Given a prompt pair $(x_{\text{clean}}, x_{\text{wrapped}})$ sharing a common semantic suffix, ACT minimizes the squared distance between their activations on this shared segment:

$\ell(x_{\text{clean}}, x_{\text{wrapped}} \mid \theta) = \mathbb{E}_{t, l} \left[ \| h_{\theta, t, l}(x_{\text{wrapped}}) - \text{sg}(h_{\theta_{\text{init}}, t, l}(x_{\text{clean}})) \|^2 \right]$

Here, the stop-gradient ( $\text{sg}$ ) on activations from the clean prompt ensures that only the wrapped prompt’s activations are updated. This loss is only applied over the suffix—i.e., the shared logical core—not prefix positions introduced by adversarial augmentation. Training data consists of such pairs, constructed via prompt augmentation methods that induce sycophancy or jailbreak cues (Irpan et al., 31 Oct 2025).

(b) Mechanism in Verification-Friendly Neural Networks

For feed-forward or CNN architectures, activation consistency is formalized at the neuron level. If $n^{(i)}_j$ is neuron $j$ in layer $i$ , the neuron behavior consistency indicator is

$\mathrm{NBC}(n^{(i)}_j, \boldsymbol{x}, \boldsymbol{x}') = \begin{cases} 1 & \text{if}~\operatorname{sign}(f^{(i)}(\boldsymbol{x})_j) = \operatorname{sign}(f^{(i)}(\boldsymbol{x}')_j) \ 0 & \text{otherwise} \end{cases}$

where $f^{(i)}(\cdot)_j$ is the pre-activation of neuron $j$ . The training objective combines standard task loss (cross-entropy) with a consistency regularizer over input and its local perturbations $\boldsymbol{x}'$ :

$\min_\theta~\mathbb{E}_{(\boldsymbol{x}, y),\, \boldsymbol{x}' \in \mathcal{C}_\epsilon(\boldsymbol{x})} \left[ \mathrm{CE}(f(\boldsymbol{x}), \boldsymbol{y}) - \beta \sum_{n_i \in \mathcal{N}} \mathrm{NBC}(n_i, \boldsymbol{x}, \boldsymbol{x}') \right]$

In practice, this is relaxed for differentiability, e.g., via cosine similarity for hidden layers and KL-divergence for output distributions (Liu et al., 17 Dec 2024).

3. Implementation Protocols and Practical Considerations

(a) Applicability

In LLMs, ACT may require instrumentation of the forward pass to access activations; only activations after the clean prompt (shared suffix) are targeted (Irpan et al., 31 Oct 2025).
For CNNs and DNNs, ACT is combined with adversarial training to cover challenging input neighborhoods. Gradient-based minimax optimization locates worst-case perturbations that minimize consistency (Liu et al., 17 Dec 2024).

(b) Data Handling

Both LLM and feed-forward domains rely on self-supervision: clean prompt activations are derived from the model itself (no external gold labels), and only prompt pairs are retained for alignment—no static datasets of model responses are required (Irpan et al., 31 Oct 2025).

(c) Hyperparameter Tuning

The consistency regularization weight ( $\beta$ ) is typically set low ( $10^{-4}$ in LLMs) to prevent excessive constraint, and layer-wise scaling can help avoid over-constraining high-dimensional layers (Liu et al., 17 Dec 2024).

(d) Stability and Scope Limitations

For transformers, initial attempts to enforce consistency over the entire prompt—including unmatched prefixes—resulted in unstable training. Restriction to the maximal matching suffix is empirically necessary (Irpan et al., 31 Oct 2025).

4. Empirical Results and Quantitative Effects

(a) Synergy and Tradeoffs in LLM Alignment

ACT and Bias-augmented Consistency Training (BCT) both reduce sycophancy rates significantly, outperforming SFT trained on stale, static datasets (Irpan et al., 31 Oct 2025). For example, on Gemma 2 27B, ACT improved sycophancy avoidance to 82.4% (baseline: 70.4%), with BCT at 80.0%.
In jailbreak resistance, BCT outperforms ACT (e.g., ClearHarm ASR reduced to 2.9% for BCT, 52.2% for ACT). ACT better preserves useful instruction-following, with reduced over-refusal rates.
General MMLU accuracy is retained or slightly improved with ACT, suggesting that internal consistency regularization can enhance task-focused attention.

(b) Verification-Friendly Training

On MNIST with CNN architectures, NBC regularization (activation consistency) yields highly verifiable models (46.3% UNSAT vs. next best 1.5% for RS at $\epsilon=0.2$ ), with increased stable neuron ratios and much faster formal verification (Liu et al., 17 Dec 2024).
NBC acts synergistically when combined with existing robust training techniques (e.g., TRADES, RS), boosting verification coverage even when individual baselines fail at increased perturbation radii.

5. Relation to Other Consistency Methods: Output vs. Activation

Method	Mechanism	Supervision Signal
BCT	Output token consistency	Self-generated model outputs (logits/tokens)
ACT	Activation consistency	Self-generated internal activations (prompts only)

ACT distinguishes itself by operating mechanistically on model internals. Unlike BCT, ACT does not risk enforcing outdated output guidelines: changes propagate directly to latent computation. ACT generally avoids training-induced over-refusal, though at the cost of less effective jailbreaking resistance compared to BCT (Irpan et al., 31 Oct 2025). Both methods, however, bypass specification/capability staleness issues.

6. Limitations and Risks

ACT requires access to internal activations, which can necessitate nontrivial engineering investment and may be less compatible with privacy-preserving or black-box deployment scenarios. Instability can arise if applied outside matched prompt segments, and both ACT and output-level consistency training risk "consistent but bad behavior" if the base clean data is itself unsafe or poorly filtered. ACT changes internals in a mechanistically distinct manner from BCT, as shown by empirical distancing of activations and cross-entropy metrics (Irpan et al., 31 Oct 2025).

7. Broader Significance and Outlook

Activation Consistency Training reflects a paradigm shift in alignment and robustness, focusing not on output matching per se but on invariance of the entire computation pipeline under manipulative or irrelevant input changes. As adversarial prompting and latent attacks proliferate, mechanistic approaches such as ACT may become foundational to alignment stacks. While not universally superior (especially in direct jailbreak resistance), ACT embodies the principle that robust alignment depends on controlling internal model reasoning, not just supervising outputs. In application domains requiring formal verification, the activation consistency approach is demonstrably effective in dramatically reducing search space and speeding up certification (Liu et al., 17 Dec 2024).

PDF Markdown Chat (Pro)

References (2)

Consistency Training Helps Stop Sycophancy and Jailbreaks (2025)

Training Verification-Friendly Neural Networks via Neuron Behavior Consistency (2024)

Follow Topic

Get notified by email when new papers are published related to Activation Consistency Training (ACT).