SkipV1Former: Efficient Decoder Transformer

Updated 26 October 2025

SkipV1Former is a decoder-only Transformer architecture that reuses half of its first-layer Value heads, reducing memory and compute requirements.
It achieves up to 25–50% KV cache reduction and lower validation loss by integrating fixed high-fidelity Value projections in deeper layers.
The design supports efficient uptraining from existing MHA checkpoints and extends seamlessly with advanced techniques like YOCO, GQA, and MLA for large-scale autoregressive tasks.

SkipV1Former is a decoder-only Transformer architecture that leverages skip connections from the first layer's Value heads for strengthening representation quality and reducing key-value (KV) cache requirements in autoregressive decoding tasks. Unlike conventional Multi-Head Attention (MHA) Transformers, which recompute and cache value projections independently in each layer, SkipV1Former reuses half of its Value heads from the very first layer starting from the second layer onward. This design achieves a substantial reduction in memory and compute costs while consistently improving empirical performance metrics across model scales.

1. Architectural Design and Formulation

SkipV1Former modifies the standard attention mechanism by interleaving Value heads from the network's first layer into subsequent layers. For a Transformer with $L$ decoder blocks and $H$ attention heads per block, each layer $i=2,\ldots,L$ replaces half of its Value projections ( $H/2$ heads) with the corresponding Value projections from the first layer in a fixed, deterministic order. Typically, heads $1$ to $H/2$ of each deeper layer are computed as usual, while heads $H/2+1$ to $H$ replicate the respective first-layer Value projections, as shown in the following formalism (with $H' = H/2$ ):

$\text{Attn}(X) = X + \sum_{h=1}^{H'} W_o^h V^h \operatorname{softmax}((K^h)^\top Q^h) + \sum_{h=H'+1}^{H} W_o^h V_1^h \operatorname{softmax}((K^h)^\top Q^h)$

where $V^h$ is the Value for head $h$ computed at the current layer, and $V_1^h$ is the fixed Value from the first layer reused in the deeper layer. The integration is performed deterministically and does not involve per-sequence decisions or dynamic computation, ensuring architecture simplicity and compatibility with batched inference.

This architectural intervention leads to nearly a halving of the Value projections and their corresponding KV cache entries in layers $2$ through $L$ , yielding substantial memory and computational savings and allowing deeper layers access to uncompressed, high-fidelity source information.

2. Theoretical Basis

The theoretical underpinning of SkipV1Former is built upon the observation that standard deep Transformers iteratively compress token-pair representations as they forward pass through stacked layers, resulting in a progressive loss of raw information. In the mesa-optimization view, deep Transformer blocks are understood as performing steps of implicit optimization—such as one-step gradient descent—on latent objectives during autoregressive modeling.

By routing the uncompressed, higher-fidelity Value vectors from the first layer directly into each deeper layer, SkipV1Former restores information otherwise irrecoverably lost to compression in upper layers. The formal theorem provided analyzes a two-layer linear attention regime and establishes that the SkipV1Former architecture drives prediction error at least a constant $c$ lower compared to the standard Transformer. This supports the claim that direct access to first-layer Values can accelerate the mesa-optimization process intrinsic to Transformer-based sequence modeling and improve representational capacity in auto-regressive tasks.

3. Empirical Performance Metrics

SkipV1Former has been evaluated on multiple GPT-2 and LLaMA model scales against the baseline MHA Transformer and advanced cache-efficient variants, including DenseFormer, ResFormer, YOCO, and CLA. Key findings include:

Persistent reductions in validation loss and perplexity relative to the baseline. For small to medium LLaMA models, validation loss is reduced by approximately $0.03$–$0.05$.
KV cache requirements are reduced by roughly $25\%$ due to the reuse of first-layer Value heads in all deeper layers.
When SkipV1Former is combined with KV cache reduction techniques such as YOCO (which reuses keys across layers), the total cache requirement is reduced by nearly $50\%$ without forfeiting or even improving modeling performance.
Loss curves on GPT-2 variants are consistently smoother and lower compared to the baseline throughout training.

SkipV1Former consistently outperforms both the conventional MHA and lightweight, cache-efficient alternatives across multiple model sizes and benchmarks.

4. Compatibility and Extensions with Advanced Techniques

SkipV1Former is designed for composability with existing and emerging KV reduction techniques. Its mechanisms have been extended to augment:

Group-Query Attention (GQA): SkipV1Former is applied prior to grouping query heads, so that half of the grouped Value heads in each deep layer are inherited from the first layer, preserving benefits of shallow-layer representation alongside memory savings.
Multi-Latent Attention (MLA): SkipV1Former interleaves each deep layer’s latent vector with the corresponding vector from the first layer, maintaining low-rank representational efficiency while providing direct access to uncompressed information.
YOCO Integration: Combining SkipV1Former with YOCO (key reuse mechanism) achieves up to $50\%$ KV cache savings compared to the original architecture. Empirically, this composite variant maintains or betters perplexity against all ablated baselines.

These integrations emphasize the flexible nature of the architecture and its broad applicability as a drop-in improvement to many cache- and compute-efficient Transformer variants.

5. Checkpoint Uptraining Methodology

Recognizing that pretrained standard MHA Transformer checkpoints are widely available, SkipV1Former introduces an “uptraining” recipe to efficiently adapt checkpoints to the new architecture:

Modify the trained MHA checkpoint by applying mean pooling over every two attention head projections in layers $2$ through $L$ , aligning the parameter shape with the SkipV1Former design.
Initialize the resulting layers of SkipV1Former with these pooled weights, while retaining unaltered first-layer Value projections.
Further pretrain from this modified checkpoint, requiring only an estimated $10$– $15\%$ of the original pretraining budget for SkipV1Former to match or surpass baseline perplexity.

This uptraining strategy enables efficient migration of LLMs to the SkipV1Former architecture, avoiding significant compute investments and data requirements traditionally associated with training from scratch.

6. Comparative Analysis

A direct comparison across multiple axes demonstrates the strengths of SkipV1Former relative to baseline and contemporary alternatives:

Model Variant	Perplexity	KV Cache Requirement	First-Layer Value Reuse
MHA Transformer	Baseline	Baseline	None
ResFormer	Comparable	No Reduction	No
DenseFormer	Comparable	No Reduction	No
CLA/YOCO	Higher	25–50% Reduction	No
SkipV1Former	Lower	25% Reduction	Yes
SkipV1-YOCO	Lower	50% Reduction	Yes

SkipV1Former is unique in delivering both improved representation (via enhanced access to uncompressed Values) and significant KV cache reduction. In the same cache-constrained regime as YOCO or CLA, SkipV1Former provides materially better perplexity and validation loss.

A salient feature is its ability to serve as a “best-of-both-worlds” architecture, simultaneously enabling memory efficiency and representational fidelity through the strategic reuse of select first-layer weights.

7. Practical Implications and Use Cases

SkipV1Former offers broad applicability for large-scale autoregressive generative models, especially in deployment contexts constrained by memory or inference speed. Potential applications include:

LLM serving, where GPU or hardware memory bottlenecks limit the size of KV caches, e.g., during massive parallel deployment or long-context inference.
Training and pretraining regimes that seek to maximize model performance per FLOP or memory investment.
Scenarios involving the migration of pre-existing MHA Transformer checkpoints, leveraging the uptraining recipe for efficient architectural upgrades.

A plausible implication is that SkipV1Former may influence future design of compute- and memory-efficient decoders in both academic prototypes and production-level generative language systems, particularly as context lengths and model sizes continue to scale.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SkipV1Former.