Frac-Connections: Fractional Extension of Hyper-Connections (2503.14125v1)

Published 18 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.

Summary

The paper introduces Frac-Connections (FC) which extend Hyper-Connections by partitioning hidden states into fractions, reducing memory footprint while preserving connection benefits and mitigating gradient issues.
Experimental validation shows that Dynamic Frac-Connections (DFC), particularly DFCx4, improve convergence, lower training loss, and enhance downstream task performance on large dense and sparse language models like OLMo2 and OLMoE.
Frac-Connections offer a memory-efficient alternative to Hyper-Connections and flexibility via static (SFC) or dynamic (DFC) weighting, demonstrating practical advantages in scalability and performance for advanced deep learning architectures.

Introduction and Background

Frac-Connections (FC) extend the Hyper-Connections (HC) framework by subdividing hidden states into multiple fractions instead of linearly expanding their width. By transitioning the expansion rate, denoted as n, to fractional values such that 0 < n < 1, the method partitions hidden states into m = 1/n parts. This operation inherently reduces memory footprint while preserving multiple connection strengths, offering a balanced trade-off between mitigating gradient vanishing and alleviating representation collapse. The framework also introduces two variants—Static Frac-Connections (SFC) and Dynamic Frac-Connections (DFC)—to provide flexibility in parameterizing the residual connection weights either as learnable static scalars or dynamically computed weights conditioned on input activations.

Methodology

Frac-Connections reformulate the hidden state integration process within deep architectures. Given a hidden state vector h, FC splits it into m sub-vectors (h₁, h₂, …, hₘ), each proceeding through specialized connection pathways. These fractions are then linearly combined with weights defined by a matrix FC. Notably, the paper initializes FC so that it mimics a Pre-Norm residual connection, effectively ensuring that the dynamical range of gradients is initially preserved.

The dynamic variant, DFC, computes its weights as follows:

Input Partitioning: The hidden state h is divided into m parts.
Weight Computation: A linear projection on each fraction is applied, followed by normalization and a non-linear activation (typically tanh), ensuring bounded activation values.
Dynamic Aggregation: The dynamic weights, initialized to zero, are then fused with the static weight matrices following a prescribed initialization structure (see Eq. 12 of the paper).

Pseudocode for the Frac-Connections within a Transformer block is provided in Algorithm 1 of the paper, and can be summarized as:

for each layer L:
    split hidden_state into m fractions: h1, ..., hm
    for each fraction:
        compute static_weight = predefined_matrix @ hi
        compute dynamic_weight = tanh(linear(hi))  # dynamic component initialized to zero
        combined_weight = static_weight + dynamic_weight
    aggregate the fractions using combined_weight
    propagate the aggregated state to the next layer L+1

This design guarantees that when n = 1 (or equivalently m = 1), the FC framework reduces to the HC paradigm.

Experimental Validation

The experimental framework validates Frac-Connections on large-scale LLMs, including both dense (OLMo2-1B2) and sparse MoE models (OLMoE-1.3B and OLMoE-7B). Key observations include:

Training Stability and Loss Reduction: DFC, particularly with a fractional rate equivalent to 4 sub-states (DFCx4), shows improved convergence rates and lower training loss compared to baseline residual connections. For instance, in the OLMoE-7B experiments, the DFC variant consistently achieved faster convergence and enhanced loss profiles.
Downstream Task Performance: Quantitative evaluations on benchmarks such as WinoGrande, MMLU Var, Commonsense QA, HellaSwag, SciQ, and BoolQ demonstrate that FC variants significantly outperform standard strong baselines. Notably, improvements were observed even when HC demonstrated faster convergence, indicating that the FC approach effectively stabilizes training without incurring the memory overhead typical of HC.
Ablation Studies: Empirical analyses further revealed that the removal of rescaling in the dynamic weights causes the most severe degradation in performance. Comparisons between DFC and SFC clearly indicate that dynamic weighting provides a tangible improvement in the model's adaptability and final task accuracy.

Comparisons to Residual and Hyper-Connections

Frac-Connections address a critical limitation inherent to both classical residual connections and Hyper-Connections. While residual connections are notorious for the seesaw effect between gradient vanishing and representation collapse, HC mitigates this at the expense of increased memory footprint due to hidden state expansion.

Memory Efficiency: By partitioning rather than expanding the hidden state, FC retains comparable gradient propagation benefits while reducing memory access costs. This structural efficiency is critical in large-scale settings where the computational budget is stringent.
Flexibility in Connection Strengths: FC’s fractional partitioning offers a continuum of connection strengths. In practice, for m = 1 (n = 1), FC is analogous to HC, but for m > 1 (n < 1), the method distributes computational resources more granularly across hidden state components, balancing expressiveness and resource constraints.
Dynamic vs. Static Parametrization: The introduction of DFC represents an advancement over static weight methods. The dynamic computation of weights conditioned on current activations allows for a more nuanced response to input variability, which is manifest in the improved convergence and task-specific performance metrics reported in the paper.

Practical Considerations and Implementation

When integrating Frac-Connections within existing architectures such as Transformers, practitioners should consider the following:

Initialization Strategy: The initialization scheme must ensure that the FC modules start as identity mappings akin to Pre-Norm residual connections. This facilitates stable gradient propagation in early training stages before dynamic weights adjust to optimal values.
Computational Overhead vs. Memory Trade-off: Although FC reduces memory overhead relative to HC, the computational complexity of dynamically computing weights introduces an overhead that must be balanced. Profiling GPU memory and compute cycles per layer is therefore advisable.
Compatibility with MoE Models: Given that the fractional division inherently aligns with partitioned MoE structures, integrating FC into MoE-based LLMs is straightforward. It is recommended to experiment with various frac-rates (e.g., DFCx2, DFCx4) to optimize for specific downstream tasks and memory constraints.
Ablation and Scaling Experiments: Empirical performance may vary based on the selected dynamic versus static configurations. Running controlled ablation studies, particularly focusing on the rescaling factors within DFC, can provide deep insights for fine-tuning models effectively.

Conclusion

Frac-Connections offer a robust extension to current residual transition strategies in deep neural networks. By partitioning hidden states and applying a combination of static and dynamic connection weights, the method significantly improves training stability and downstream performance on large-scale language tasks while mitigating memory consumption. Its performance on both dense and sparse architectures, along with the compatibility with MoE models, demonstrates its potential to serve as a scalable and practical mechanism for advanced model architectures. Quantitative improvements across several benchmarks further substantiate the efficacy of this approach, making it a compelling option for next-generation deep learning implementations.