Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations (2507.00269v1)

Published 30 Jun 2025 in q-bio.NC and cs.AI

Abstract: Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2x2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. This work provides systematic evidence for (1) dual encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.

Summary

The paper proposes a dual encoding hypothesis where neural networks represent both feature identity and integration for improved interpretability.
It introduces a joint-training architecture combining SAEs and NFMs, achieving a 41.3% improvement in reconstruction and a 51.6% reduction in KL divergence.
Results from PCA and ANOVA analyses confirm that integration features capture complex inter-concept relationships, supporting emergent behavior in networks.

Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

Introduction

The paper explores the neural network interpretability landscape, challenging the prevalent assumption that neural representations follow a linear superposition model. Sparse autoencoders (SAEs), traditionally used to achieve high reconstruction fidelity by decomposing activations into sparse, interpretable components, face limitations in eliminating polysemanticity and solving pathological behavioral errors. This paper posits that neural networks encode data in dual spaces: feature identity and feature integration. The research introduces joint-training architectures to address these challenges and reports significant quantitative and qualitative improvements in model performance.

Dual Encoding Hypothesis

The dual encoding hypothesis suggests neural networks encode information not solely based on identifiable sparse features but also on the integration of these features into complex emergent behaviors. The feature identity space captures individual concepts, while the feature integration space pertains to interactions between these concepts, leading to novel emergent meanings. This framework reframes non-orthogonal representations as computational structures that convey meaningful inter-concept relationships. The key contribution is reconceiving these representations as integral to neural interpretation, as opposed to viewing them as compression artifacts.

Methodology

Experimental Pipeline

The methodology employs a sequential process:

Sparse Feature Extraction: Initial SAE training extracts sparse features with a TopK constraint, promoting automatic sparsity regularization.
Integration Pattern Capture: A Neural Factorization Machine (NFM) predicts SAE reconstruction residuals to identify integration patterns.
Integration Space Analysis: Secondary SAE layers analyze the dense NFM embeddings, emphasizing feature integration.

The joint-training architecture integrates all components concurrently, facilitating natural specialization of feature types during learning. This architecture utilizes the Adam optimizer with a linear learning rate decay over 80% of training steps, ensuring efficient convergence on the WikiText-103 dataset.

Training Setup

The architecture was trained using OpenLLaMA-3B activations, focusing on the model's middle layer for extracting information from 50-token window sequences. The experiments took advantage of a single NVIDIA RTX 3090 GPU for computational efficiency.

Results

Quantitative Analysis

The paper reports a substantial 41.3% improvement in reconstruction over traditional SAE models. Joint-training architectures showed a 51.6% reduction in KL divergence errors compared to baseline methods. Feature analysis revealed that linear interactions account for a majority of reconstruction improvements, with non-linear components contributing significantly despite their minimal parameter footprint.

Feature Integration Characteristics

Stimulus-driven experiments identified features with selective sensitivity and interaction-driven behavioral changes, validating the functional specificity of integration features through ANOVA analyses. These experiments underscored the emergence of significant interaction effects across semantic dimensions, confirming the computational relevance of integrated features.

Joint Training Architecture Performance

The joint architecture surpassed the sequential methodology, with its bimodal feature organization reflecting distinct roles in feature specialization. PCA analysis highlighted higher-dimensional feature representations in the joint-training setup, indicative of enhanced capacity to capture integration patterns.

Discussion

The paper’s findings underscore a paradigm shift from post hoc feature analysis to integrated computational design. The results suggest that joint training naturally uncovers and harnesses feature integrations alongside identity features, addressing the inadequacies observed in traditional sparse autoencoders. These insights challenge existing interpretations of polysemantic neurons, positing them as computationally meaningful rather than mere artifacts.

Future directions may include scaling to larger models, refining interpretability techniques for integration features, and exploring alternative architectures for capturing structured relationships effectively.

Conclusion

The introduction of joint training architectures signifies a transformative step in neural network interpretability, supporting the dual encoding hypothesis with robust empirical evidence. This work sets a new precedent by achieving significant improvement across fidelity and interpretive metrics. It demonstrates that computational relationships within neural networks can be efficiently and effectively captured, offering profound implications for model reliability, control, and the broader understanding of intelligence in AI systems.