- The paper proposes a dual encoding hypothesis where neural networks represent both feature identity and integration for improved interpretability.
- It introduces a joint-training architecture combining SAEs and NFMs, achieving a 41.3% improvement in reconstruction and a 51.6% reduction in KL divergence.
- Results from PCA and ANOVA analyses confirm that integration features capture complex inter-concept relationships, supporting emergent behavior in networks.
Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations
Introduction
The paper explores the neural network interpretability landscape, challenging the prevalent assumption that neural representations follow a linear superposition model. Sparse autoencoders (SAEs), traditionally used to achieve high reconstruction fidelity by decomposing activations into sparse, interpretable components, face limitations in eliminating polysemanticity and solving pathological behavioral errors. This paper posits that neural networks encode data in dual spaces: feature identity and feature integration. The research introduces joint-training architectures to address these challenges and reports significant quantitative and qualitative improvements in model performance.
Dual Encoding Hypothesis
The dual encoding hypothesis suggests neural networks encode information not solely based on identifiable sparse features but also on the integration of these features into complex emergent behaviors. The feature identity space captures individual concepts, while the feature integration space pertains to interactions between these concepts, leading to novel emergent meanings. This framework reframes non-orthogonal representations as computational structures that convey meaningful inter-concept relationships. The key contribution is reconceiving these representations as integral to neural interpretation, as opposed to viewing them as compression artifacts.
Methodology
Experimental Pipeline
The methodology employs a sequential process:
- Sparse Feature Extraction: Initial SAE training extracts sparse features with a TopK constraint, promoting automatic sparsity regularization.
- Integration Pattern Capture: A Neural Factorization Machine (NFM) predicts SAE reconstruction residuals to identify integration patterns.
- Integration Space Analysis: Secondary SAE layers analyze the dense NFM embeddings, emphasizing feature integration.
The joint-training architecture integrates all components concurrently, facilitating natural specialization of feature types during learning. This architecture utilizes the Adam optimizer with a linear learning rate decay over 80% of training steps, ensuring efficient convergence on the WikiText-103 dataset.
Training Setup
The architecture was trained using OpenLLaMA-3B activations, focusing on the model's middle layer for extracting information from 50-token window sequences. The experiments took advantage of a single NVIDIA RTX 3090 GPU for computational efficiency.
Results
Quantitative Analysis
The paper reports a substantial 41.3% improvement in reconstruction over traditional SAE models. Joint-training architectures showed a 51.6% reduction in KL divergence errors compared to baseline methods. Feature analysis revealed that linear interactions account for a majority of reconstruction improvements, with non-linear components contributing significantly despite their minimal parameter footprint.
Feature Integration Characteristics
Stimulus-driven experiments identified features with selective sensitivity and interaction-driven behavioral changes, validating the functional specificity of integration features through ANOVA analyses. These experiments underscored the emergence of significant interaction effects across semantic dimensions, confirming the computational relevance of integrated features.
The joint architecture surpassed the sequential methodology, with its bimodal feature organization reflecting distinct roles in feature specialization. PCA analysis highlighted higher-dimensional feature representations in the joint-training setup, indicative of enhanced capacity to capture integration patterns.
Discussion
The paper’s findings underscore a paradigm shift from post hoc feature analysis to integrated computational design. The results suggest that joint training naturally uncovers and harnesses feature integrations alongside identity features, addressing the inadequacies observed in traditional sparse autoencoders. These insights challenge existing interpretations of polysemantic neurons, positing them as computationally meaningful rather than mere artifacts.
Future directions may include scaling to larger models, refining interpretability techniques for integration features, and exploring alternative architectures for capturing structured relationships effectively.
Conclusion
The introduction of joint training architectures signifies a transformative step in neural network interpretability, supporting the dual encoding hypothesis with robust empirical evidence. This work sets a new precedent by achieving significant improvement across fidelity and interpretive metrics. It demonstrates that computational relationships within neural networks can be efficiently and effectively captured, offering profound implications for model reliability, control, and the broader understanding of intelligence in AI systems.