Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Training with Progressive Activation Sharing

Updated 3 February 2026
  • The paper introduces a progressive activation sharing mechanism that replaces redundant QK or KV computations across transformer layers to improve training efficiency.
  • It employs a dynamic sharing region schedule that reduces FLOPs, yielding up to 11% training and 29% inference throughput gains with minimal accuracy loss.
  • EPAS applies to vision transformers through progressive activation growth, achieving substantial GPU-hour reductions and maintaining competitive accuracy.

Efficient Training with Progressive Activation Sharing (EPAS) is a methodology designed to substantially accelerate neural model training and inference—primarily for deep transformers and vision transformers—by systematically exploiting redundancy in layerwise activations. EPAS integrates progressive learning principles with activation sharing across transformer layers, enabling flexible trade-offs between computational cost and model accuracy, and producing models that are both efficient and robust under adaptive activation-sharing regimes (Karim et al., 27 Jan 2026, Li et al., 2022).

1. Theoretical Foundations and Motivation

Transformer-based architectures exhibit significant redundancy in the projected query, key, and value activations (Q, K, V) among consecutive layers, especially as model depth increases. Empirical studies, including LazyFormer and ShareAttn, reveal that in deep transformers, the differences in QK or KV between adjacent layers are minimal, indicating that continuous recomputation leads to unnecessary FLOPs expenditure (Karim et al., 27 Jan 2026). In vision transformers and other progressive learning contexts, related work demonstrates that progressive model or activation growth reduces overall training costs without detrimental accuracy effects by leveraging redundancy and schedule-driven activation (Li et al., 2022). These observations provide the conceptual grounding for activation sharing.

2. Progressive Activation Sharing Mechanism

EPAS introduces the notion of a "sharing region" R(t){1,...,L1}R(t)\subseteq\{1, ..., L-1\} at training step tt, denoting the subset of decoder layers whose activations are not recomputed but shared ("borrowed") from the next deeper layer. For QK sharing, layer R(t)\ell\in R(t) uses: Q~=Q+1,K~=K+1\tilde Q_\ell = Q_{\ell+1}, \quad \tilde K_\ell = K_{\ell+1} and computes VV_\ell normally. This mechanism is extensible to KV-sharing by swapping KVK \rightarrow V (Karim et al., 27 Jan 2026).

The sharing region is empty at initialization, and progressively grows from deep to shallow layers during training, capitalizing on greater redundancy in deeper layers. The scheduling is governed by a growth function, often parameterized as: R(t)=min(tS,L1)|R(t)| = \min\left(\left\lfloor\frac{t}{S}\right\rfloor, L-1\right) where SS is a tunable step interval. Alternatively, batch-wise growth is supported: R(t)=min(tIB,Sc)|R(t)| = \min\left( \left\lfloor \frac{t}{I} \right\rfloor \cdot B, |S_c| \right) where II is the interval, BB the block size, and Sc|S_c| the maximum allowed sharing block size. At each eligible interval, a set of the deepest layers is toggled into sharing mode.

3. Training Workflow and Hyperparameters

The standard training loop for EPAS is a strict superset of canonical autoregressive transformer training. At each step, a check for sharing region updates is conducted, toggling layers into shared mode according to the progressive schedule. During a forward pass, layers within R(t)R(t) reuse cached QK activations; other layers proceed as normal.

The loss function remains unchanged, typically using autoregressive cross-entropy: L(θ)=n=1Nt=1Tlogpθ(xt(n)x<t(n))\mathcal{L}(\theta) = -\sum_{n=1}^{N} \sum_{t=1}^{T} \log p_\theta(x_t^{(n)}|x_{<t}^{(n)}) No auxiliary losses are introduced. Primary hyperparameters include base learning rate (e.g., 1e41\mathrm{e}{-4}), batch size (e.g., $2048$ tokens/device), growth interval II, block size BB, and sharing region cap ScL/2|S_c|\approx \lfloor L/2 \rfloor (Karim et al., 27 Jan 2026).

4. Inference Adaptability and Throughput Gains

Post-training, EPAS models retain switchable sharing masks, allowing at-inference adjustment of the sharing region R(inf)R^{(\text{inf})} of user-specified length mScm\leq|S_c|. During inference, for layers in R(inf)R^{(\text{inf})}, QK activations are reused, while all others are freshly calculated. This enables controllable trade-offs: larger R(inf)|R^{(\text{inf})}| increases throughput (tokens/sec) at the potential cost of accuracy, enabling dynamic deployment optimization for diverse compute budgets (Karim et al., 27 Jan 2026).

Empirically, in LLaMA family models (125M–7B parameters) and TinyLLaMA-1.1B, EPAS yields up to 11.1% improvement in training throughput and up to 29.2% increase in inference throughput when sharing up to 50% of the layers. Under continual pretraining on TinyLLaMA, EPAS-trained models achieve up to 10% higher mean accuracy on 11 LM benchmarks compared to baselines, specifically retaining accuracy robustness under aggressive inference-time sharing, where naïvely shared baseline models experience severe degradation (Karim et al., 27 Jan 2026).

5. Application to Vision Transformers: Progressive Learning and MoGrow

In vision transformers, EPAS is applied by progressively activating larger subsets of patches and transformer layers as training proceeds. Early epochs use reduced tokens and depths; as regularization and representational needs increase, more activations are enabled, culminating in full-model finetuning.

A critical component is Momentum Growth (MoGrow): at each progression step from smaller to larger sub-net, weights for new layers are interpolated from a momentum teacher ensemble maintained throughout current-stage training: ω~smω~s+(1m)ωs\tilde \omega_{s} \leftarrow m \cdot \tilde \omega_{s} + (1 - m) \cdot \omega_{s} with m=0.998m=0.998. Enlarged sub-nets are initialized from these momentum-averaged weights to prevent performance collapse associated with cold starts (Li et al., 2022).

Progressive or automated activation growth (AutoProg) further optimizes the growth schedule by formulating a sub-network search: at each stage kk, the training framework selects the sub-net ψk\psi_k^* that minimizes the composite cost: ψk=argminψΛk[L(ω(ψ))T(ψ)α]\psi_k^* = \arg\min_{\psi \in \Lambda_k} [ L(\omega^*(\psi)) \cdot T(\psi)^\alpha ] where α>0\alpha > 0 trades training loss against runtime, and ω(ψ)\omega^*(\psi) are quickly estimated via weight nesting within an elastic supernet. Stages are typically spaced every quarter of total epochs, and candidate sub-nets are filtered for monotonic capacity growth (Li et al., 2022).

Through these mechanisms, ImageNet runs with DeiT-S and VOLO-D1/D2 achieve 40–85% reduction in GPU hours with unchanged or slightly superior top-1 accuracy (e.g., VOLO-D1, EPAS s₁=0.4: 81h, +85.1% speedup, 82.7% top-1 vs. baseline 82.6%) (Li et al., 2022).

6. Quantitative Results, Ablations, and Limitations

Performance benefits include up to 8% theoretical FLOP reduction by sharing QK in half the layers, 10–11% observed training throughput gain (tokens/s), and up to 29% inference throughput improvement. Loss curves typically closely match or outperform baselines, with ∼5% faster wall-clock convergence at equal time (Karim et al., 27 Jan 2026).

Ablation studies show negligible differences whether or not the last transformer layer participates in sharing. A single contiguous sharing block outperforms equivalent multisplit blocks. In vision transformers, careful parameter management and growth thresholds are required for stability during sub-net transitions (Li et al., 2022).

Limitations include increased memory overhead for activation caches, especially for very deep sharing regions, and the risk of accuracy drop if the sharing region exceeds approximately half of model depth. Precision-accuracy trade-offs at inference depend on the corresponding training-time sharing schedule; a plausible implication is that exploring adaptive or learned sharing regimes could further improve these trade-offs (Karim et al., 27 Jan 2026).

7. Future Directions

Several immediate research avenues are proposed:

  • Generalization to Other Modalities: Extending activation sharing and progressive schedules to domains beyond language and vision, such as speech and multimodal transformers.
  • Learnable Growth Schedules: Optimizing the layerwise activation-sharing progression as a learnable parameter, rather than using fixed schedules.
  • Auxiliary Regularization: Exploring additional regularizers—such as teacher-student distillation or activation gap minimization—to mitigate any residual quality gaps as sharing region size increases.
  • KV-Sharing and Other Configurations: Adapting the underlying mechanism to KV-only or hybrid QKV sharing to address architecture-specific bottlenecks (Karim et al., 27 Jan 2026).

EPAS thus provides a unified and extensible framework for efficient model training and multi-regime inference, relying on fundamental principles of progressive learning, activation redundancy, and systematic growth of computational complexity. Its deployment in both transformer LMs and ViTs demonstrates reliable acceleration with minimal or no loss in task performance (Karim et al., 27 Jan 2026, Li et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Efficient Training with Progressive Activation Sharing (EPAS).