Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models (2505.17697v1)

Published 23 May 2025 in cs.CL and cs.LG

Abstract: Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in LLMs typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.

Summary

The paper introduces a training-free activation control method (EELo-CoT) that efficiently elicits long chain-of-thought reasoning in LLMs without costly retraining.
It identifies sparse, final-layer activations responsible for extended reasoning and selectively amplifies them to trigger self-reflection and improve performance.
Experimental results show improved accuracy (e.g., from 69.20% to 72.00%) and reflection rates across benchmarks, validating the efficacy of targeted activation interventions.

This paper, "Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of LLMs" (2505.17697), investigates how LLMs produce long chain-of-thought (CoT) reasoning and proposes methods to elicit this ability efficiently, without costly retraining or supervised fine-tuning on extensive datasets.

The core idea is that the long CoT ability, characterized by extended reasoning steps and self-reflection, is significantly influenced by a small set of activations, primarily in the final layers of the LLM. By identifying and manipulating these specific activations, the model's reasoning style can be guided towards long CoT.

Key Empirical Findings:

The authors conducted several empirical analyses, primarily using Qwen2.5-7B series models, leading to these key findings:

Localized Activations: Long-CoT related activations are sparse and predominantly located in the last few layers of the LLM. The final layer, in particular, contains a high concentration (over 50%) of these activations.
Activation Differences: LLMs explicitly trained or distilled for long CoT (e.g., R1-distilled-Qwen-7B) exhibit a higher number of these specific long-CoT related activations compared to base models not fine-tuned for this capability.
Inducing Long CoT: Simply amplifying the top-200 identified long-CoT related activations (by factors like 1.2 to 1.6) and inserting a "wait" token (especially after sentences containing mathematical equations) can significantly increase the model's self-reflection rate and task accuracy.
Activation Dynamics: Base models and long-CoT models show similar sparse activation dynamics for these key neurons. They are mostly inactive but spike in value at specific trigger points, such as after a "wait" token.
Inactive Instruct-Tuned Activations: In contrast, models heavily fine-tuned on short instructions (e.g., Qwen2.5-7B-Instruct) show these specific activations as largely inactive or "dead," potentially due to the post-training biasing them against longer, reflective reasoning patterns.
Predictable Patterns: The activation values around trigger tokens like "wait" follow a predictable pattern: a sharp rise immediately after the token, followed by an exponential (or logarithmic) decay. This predictability forms the basis for the proposed training-free control method.

Training-Free Activation Control (EELo-CoT)

Based on these findings, the paper introduces a training-free method called EELo-CoT to elicit long CoT at inference time:

Identifying Key Activations:
- A small set of contrastive example pairs is collected. These pairs consist of (question, positive CoT, negative CoT).
- Positive CoTs are long, contain self-reflection (e.g., phrases like "wait," "let me double check"), and lead to correct answers.
- Negative CoTs are short, lack self-reflection, and lead to incorrect answers.
- These examples are fed into the base LLM, and the differences in MLP layer activation values between positive and negative examples are computed. Activations with high differences (e.g., >4) are considered long-CoT related. The top-N (e.g., 150-200) are selected.
Activation Amplification with an Analytic Function:
- The observed activation decay pattern after a trigger token (like "wait") is modeled using an analytic function:
  
  $f(t) = a - b \cdot \log(t + c)$
  
  where $t$ is the relative token distance from the trigger.
- The coefficients ( $a, b, c$ ) are fitted using the activation trajectories collected from the contrastive examples (e.g., for Qwen2.5-7B-base, $a = 0.17, b = 0.033, c = -0.997$ ).
- During inference, when a trigger condition is met, the identified key activations $A$ are modified to $A'$ :
  
  $A' = A \cdot (1 + \alpha f(t))$
  
  where $\alpha$ is a tunable scaling factor (e.g., 4).
Forcing Reflection Strategy:
- To trigger self-reflection and activate the amplification, a "wait" token is inserted at the beginning of the next sentence if the previously generated sentence contains $k$ or more digits (e.g., $k=5$ ).
- A cool-down window (e.g., 4 sentences) is implemented to prevent excessive, repetitive self-reflection.

Implementation of Training-Free EELo-CoT:

def get_contrastive_activations(model, positive_example, negative_example):
    # Record MLP activations for both examples
    # Compute difference and identify top N high-impact activations
    # Return indices of these activations and their trajectories around "wait"
    pass

def fit_decay_function(trajectories):
    # Fit f(t) = a - b * log(t + c) to the observed activation decays
    # Return coefficients a, b, c
    pass

def modified_forward_pass(model, input_ids, current_token_position):
    original_activations = model.get_internal_activations(input_ids) # Get all activations
    modified_activations = original_activations.copy()

    # Apply amplification if within active window of a "wait" token
    if is_within_trigger_window(current_token_position, "wait"):
        t = distance_from_trigger(current_token_position, "wait")
        if t >= 0: # Ensure t is not negative for log
            amplification_value = a - b * math.log(t + c) # Ensure t+c > 0
            for layer_idx, neuron_idx in key_activation_indices:
                 # Assuming key_activation_indices stores (layer, neuron_idx_in_layer)
                 # and original_activations can be indexed this way
                 original_val = original_activations[layer_idx][neuron_idx]
                 modified_activations[layer_idx][neuron_idx] = original_val * (1 + alpha * amplification_value)

    # Proceed with model's forward pass using modified_activations
    output_logits = model.compute_output_logits(input_ids, modified_activations)
    return output_logits

def generate_response(model, prompt):
    generated_tokens = []
    # ... (token generation loop) ...

    # Inside the loop, after generating a sentence:
    last_sentence = get_last_sentence(generated_tokens)
    num_digits_in_last_sentence = count_digits(last_sentence)
    global sentences_since_last_reflection

    if num_digits_in_last_sentence >= k_digits_for_reflection and \
       sentences_since_last_reflection > reflection_cooldown:
        # Insert "wait" token
        # Add "wait" token_id to the input for the next generation step
        # Reset sentences_since_last_reflection = 0
        # Mark the position of "wait" token to trigger amplification for subsequent tokens
        pass
    else:
        sentences_since_last_reflection += 1

    # Use modified_forward_pass for generating next token
    # ...
    return generated_tokens

Experimental Results (Training-Free):

Tested on MATH, AMC23, and GPQA-Diamond datasets using Qwen2-7B-base, Qwen2.5-7B-base, and Qwen2.5-Math-7B-base models.
EELo-CoT consistently improved accuracy and significantly increased the self-reflection rate compared to baseline models and simpler intervention strategies (e.g., just forcing reflection or constant activation amplification).
- For instance, on Qwen2.5-7B-base for Math500, accuracy went from 69.20% to 72.00%, and reflection rate from 10.20% to 49.40%.
- On Qwen2.5-Math-7B-base for Math500, accuracy increased from 68.00% to 76.00%, and reflection rate from 73.80% to 90.60%.
The method also showed positive results on smaller models (Qwen2.5-1.5B, Qwen2.5-3B) and other model families like Llama-3.1-8B-base and larger Qwen2.5-32B-base.

Parameter-Efficient Fine-Tuning Method

Leveraging the insight that key activations are localized, the paper also proposes a parameter-efficient fine-tuning approach:

Activation Amplification Module:
- A lightweight, learnable module is added to the last MLP layer for the identified key activations (e.g., $n=100$ ).
- The original activation $A(x) = \text{Act}(x W_g)$ is modified to:
  
  $A(x) = \text{Act}(x W_g) \odot \sigma(x W_a) \cdot \beta$
  
  where $W_a \in \mathbb{R}^{h \times n}$ is a trainable projection matrix, $\sigma$ is the sigmoid function, and $\beta$ is a trainable scalar. This allows the model to learn context-dependent amplification scales.
Training Strategy:
- Only the parameters of the activation amplification module ( $W_a, \beta$ ) and LoRA (Low-Rank Adaptation) layers in the preceding layers are trained.
- The rank for LoRA in earlier layers is set to a lower value (e.g., 64) compared to typical LoRA fine-tuning (e.g., 256).
- This results in training a very small percentage of total model parameters (e.g., 1.51% for Qwen2.5-32B-Instruct).

Implementation of Parameter-Efficient Training:

import torch
import torch.nn as nn

class AmplifiedMLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size, num_key_activations):
        super().__init__()
        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False) # W_g in paper, part of original MLP
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)   # Part of original MLP if using SwiGLU etc.
        self.act_fn = nn.SiLU() # Example activation

        # New learnable components for amplification
        self.W_a = nn.Linear(hidden_size, num_key_activations, bias=False)
        self.beta = nn.Parameter(torch.tensor(1.0)) # Trainable scalar

        # Assume key_activation_indices_in_intermediate are indices within the intermediate_size
        self.key_activation_indices = torch.tensor(...) # Shape: [num_key_activations]

    def forward(self, x):
        # Original MLP computation (simplified example)
        hidden_states_gate = self.gate_proj(x)
        activated_states = self.act_fn(hidden_states_gate) # This is 'A(x)' before amplification

        # Compute amplification scales for key activations
        contextual_scales_for_keys = torch.sigmoid(self.W_a(x)) * self.beta # Shape: [batch, seq_len, num_key_activations]

        # Create a full amplification mask for all intermediate neurons
        # Initialize with 1s (no amplification for non-key neurons)
        full_amplification_mask = torch.ones_like(activated_states)

        # Apply learned scales to the key activation positions
        # This requires careful indexing if num_key_activations < intermediate_size
        # For simplicity, if num_key_activations == intermediate_size, it's easier.
        # Otherwise, one might need to map key_activation_indices to specific columns.
        # A common approach is to make W_a project to intermediate_size, then select.
        # Or, as in the paper, the amplification is element-wise on the *output* of Act(xWg),
        # so if W_a's output dim is n, it means we select n specific neurons from Act(xWg) to amplify.

        # Assuming activated_states is [batch, seq_len, intermediate_size]
        # and key_activation_indices select specific neurons in the intermediate_size dimension.
        # The paper's formula: A(x) = Act(x W_g) * σ(x W_a) * β
        # This implies W_a's output dimension n matches the number of dimensions of Act(x W_g) we want to control.
        # If we only control a subset, the σ(x W_a) * β part applies only to those.

        # Let's assume W_a projects to intermediate_size and we apply amplification to all.
        # Or, more closely to the paper, W_a's output is n, and it multiplies n chosen activations.
        # For this pseudocode, let's assume we want to amplify `num_key_activations` specific neurons.
        # The paper implies element-wise product (Hadamard product ⊙)
        # A(x) = Act(x W_g) ⊙ ( σ(x W_a) * β )
        # This means W_a's output dimension must match Act(x W_g)'s dimension,
        # or a selection mechanism is used. The paper states W_a has h x n.
        # This suggests σ(x W_a) * β produces a vector of length n, which is then used
        # to scale n *specific* neurons in Act(x W_g).

        # Let's assume self.key_activation_indices are the indices in the intermediate_size dimension
        # that correspond to the n key activations.
        # amplification_factors_for_keys will be [batch, seq_len, num_key_activations]
        amplification_factors_for_keys = torch.sigmoid(self.W_a(x)) * self.beta

        # Apply these factors to the specific key activations within activated_states
        # This is a simplification; a more robust implementation would use index_select and index_put
        # For all i in num_key_activations:
        #   activated_states[:, :, self.key_activation_indices[i]] *= amplification_factors_for_keys[:, :, i]

        # More direct interpretation from paper: A(x) = Act(x W_g) odot amplification_module_output
        # Where amplification_module_output is 1 for non-key activations and learnt_scale for key ones.
        # This is what the full_amplification_mask above tries to achieve.

        # If the amplification module is a gate on *selected* activations from Act(x W_g):
        selected_activations = activated_states[..., self.key_activation_indices] # [batch, seq, n]
        scales = torch.sigmoid(self.W_a(x)) * self.beta # [batch, seq, n]
        amplified_selected_activations = selected_activations * scales

        # Put them back into the full activation tensor
        final_activated_states = activated_states.clone()
        # Using advanced indexing or a loop to update only the key activations
        # For example, if intermediate_size is d and n key activations are selected:
        # final_activated_states[..., self.key_activation_indices] = amplified_selected_activations

        # This part of the paper needs careful translation to code:
        # "multiply the original activation values with the corresponding amplification scale in the vector."
        # This implies a selective multiplication.

        # Let's assume a simplified gate that applies to the *entire* intermediate layer for now,
        # if W_a's output dim is intermediate_size.
        # If W_a's output dim is n (num_key_activations), then it means we are only modifying
        # a subset of the MLP's internal neurons. The paper says "W_a \in R^{h x n}" and n=100.
        # This indicates that the amplification is specific to n neurons.

        # Example: if activated_states has shape [B, S, D_intermediate]
        # and we select N_KEY neurons.
        # amplification_signal = torch.sigmoid(self.W_a(x)) * self.beta # [B, S, N_KEY]
        # Assume key_neuron_indices is a tensor [N_KEY]
        # Create a multiplier tensor, init with 1s for D_intermediate neurons
        multiplier = torch.ones_like(activated_states)
        # Scatter the amplification_signal into the specific key neuron positions
        multiplier.index_put_((Ellipsis, self.key_neuron_indices), amplification_signal)
        final_activated_states = activated_states * multiplier

        # ... then feed to the next part of MLP (e.g., down_proj)
        # output = self.down_proj(final_activated_states * self.up_proj(x)) # if SwiGLU-like
        output = self.down_proj(final_activated_states) # if simple MLP
        return output

Experimental Results (Parameter-Efficient Training):

Fine-tuned Qwen2.5-32B-Instruct on the LIMO dataset (817 math/logic problems).
The proposed method (1.51% trainable params) was compared to full fine-tuning (100% params) and standard LoRA (6.15% params, rank 256).
Achieved comparable or better accuracy than full fine-tuning and LoRA on Math500, AMC23, and GPQA benchmarks.
- E.g., on GPQA, EELo-CoT got 70.02% accuracy, LoRA 66.17%, Full fine-tuning 69.19%.
- On AMC23, EELo-CoT got 88.75%, LoRA 85.00%, Full fine-tuning 92.50%. It also used significantly fewer tokens on AMC23 (7077 vs ~14000 for others).
This highlights that complex reasoning behaviors like long CoT can be elicited by targeting a small, specific set of parameters, and the learned strategies can generalize.

Practical Implications and Considerations:

Efficiency: Both proposed methods offer significant efficiency gains over traditional SFT or RL approaches for eliciting long CoT. The training-free method requires no gradient updates, while the parameter-efficient method drastically reduces the number of trainable parameters.
Identifying Activations: The process of finding key activations relies on having a small set of good contrastive examples (long/reflective/correct vs. short/non-reflective/incorrect). The quality of these examples can influence which activations are identified.
Trigger Mechanisms: The "wait" token and the digit-counting heuristic are simple but effective triggers. For other tasks or reasoning styles, different triggers might be needed.
Hyperparameter Tuning: The scaling factor $\alpha$ in the training-free method, and $k$ (digit count), cooldown period, and $n$ (number of key activations) are hyperparameters that may need tuning for optimal performance on different models or datasets. The paper used $\alpha=4$ , $k=5$ , cooldown=4, $n=150$ for the training-free experiments. For parameter-efficient training, $n=100$ .
Model Specificity: While shown to work across Qwen and Llama models, the exact activation indices and optimal function coefficients ( $a,b,c$ ) for the training-free method are likely model-specific.
Deployment: The training-free method modifies inference logic by adding hooks to access and modify activations and to insert tokens. The parameter-efficient method results in a modified model checkpoint with minimal changes.
Computational Cost: The training-free method adds a small overhead at inference due to activation retrieval, computation of $f(t)$ , and modification. The parameter-efficient method has standard inference costs.

In summary, the paper provides strong empirical evidence that long CoT reasoning is tied to specific, localized activation patterns. It offers two practical pathways—a training-free inference-time intervention and a parameter-efficient fine-tuning strategy—to effectively elicit this desirable reasoning behavior in LLMs. These methods are particularly valuable for complex reasoning tasks where generating long, deliberative thought processes is beneficial.

PDF Markdown

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models (2505.17697v1)

Summary

Related Papers