Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition (2503.22059v1)

Published 28 Mar 2025 in cs.LG, eess.SP, and stat.ML

Abstract: Modular addition tasks serve as a useful test bed for observing empirical phenomena in deep learning, including the phenomenon of \emph{grokking}. Prior work has shown that one-layer transformer architectures learn Fourier Multiplication circuits to solve modular addition tasks. In this paper, we show that Recurrent Neural Networks (RNNs) trained on modular addition tasks also use a Fourier Multiplication strategy. We identify low rank structures in the model weights, and attribute model components to specific Fourier frequencies, resulting in a sparse representation in the Fourier space. We also show empirically that the RNN is robust to removing individual frequencies, while the performance degrades drastically as more frequencies are ablated from the model.

Summary

An Analytical Exploration of Fourier Structure in RNNs on Modular Addition Tasks

The paper explores the intriguing relationship between modular addition tasks and recurrent neural network (RNN) architectures, particularly focusing on the emergence of Fourier multiplication circuits within these networks. The research extends previous findings related to one-layer transformer models and illustrates that RNNs can also employ a Fourier-based strategy to approach modular addition.

Key Findings

Recurrent Neural Networks trained on modular addition tasks demonstrate low rank structures in their model weights. These components correlate with specific Fourier frequencies, leading to a sparse representation in the Fourier domain. This reveals an underlying mechanism where RNNs approach modular addition tasks using sparse Fourier representations, similar to transformer architectures.

Methodological Overview

The study employs mechanistic interpretability, a burgeoning field aimed at reverse engineering the workings of deep learning models. It details the decomposition of the RNN's computational graph to analyze Fourier coefficients across various nodes, demonstrating sparse representations at each level.

Additionally, the research investigates the singular value spectra of RNN weights, uncovering a structured low-rank component correlated with achieving peak performance on modular addition tasks. Further, the alignment between singular vector components and Fourier frequencies is scrutinized, establishing a connection between rank reduction and sparse Fourier characteristics.

Experimental Findings

Sparse Fourier Representation: All the computational nodes, including embeddings, hidden states, and output layers, display a sparse Fourier spectrum, centering around a concise set of frequencies.
Low Rank Structure: The weights of RNNs exhibit low-rank properties, where significant rank reduction in the embedding, unembedding, and input-hidden matrices is essential for maintaining model accuracy.
Frequency Ablation Tests: Single frequency removal showed little effect on model performance; however, multiple frequency ablations significantly degraded accuracy.

Implications

The paper's discoveries offer valuable insights into the efficiency and potential structure of neural networks, emphasizing a compact, efficient domain that could enhance interpretability and adaptability for complex tasks beyond modular addition. Sparse representations and low-rank features indicate pathways for optimizing network architectures, potentially reducing computational costs while maintaining robust performance.

Furthermore, understanding Fourier-based strategies in RNNs opens avenues for advancing theoretical models to predict how networks generalize, possibly shedding light on phenomena such as grokking.

Future Research Directions

This study suggests several intriguing paths for future research:

Expansion to Complex Tasks: Investigating whether the low-rank, sparse Fourier properties extend to more complex algorithmic tasks or real-world applications.
Mechanisms of Sparsity: Exploring which factors during training lead to spontaneous sparsity and low-rank configurations.
Comparison Across Architectures: Further comparison between RNNs and transformer models in various configurations to delineate the bounds of Fourier structure utilization.

In conclusion, the paper enriches the discourse on modular arithmetic and neural architectures, encouraging mechanistic exploration of what governs sparse representations and low-rank symmetry in deep learning models.