Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 28 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Focus Your Attention (with Adaptive IIR Filters) (2305.14952v2)

Published 24 May 2023 in cs.LG and eess.SP

Abstract: We present a new layer in which dynamic (i.e.,input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The new layer is grounded in control theory, and is shown to generalize diagonal state-space layers. The layer performs on-par with state-of-the-art networks, with a fraction of their parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Heyna, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems.

References (62)

Citations (8)

View on Semantic Scholar

Collections

Summary

The paper introduces a novel Focus layer that uses adaptive order-two IIR filters, dynamically generated by a hypernetwork, to pre-process inputs before local attention.
The methodology employs efficient frequency-domain filtering combined with chunked causal local attention, achieving near-linear complexity for long sequences.
Performance tests show that Focus matches or surpasses models like Hyena, MEGA, and GPT2 while using fewer parameters and lower peak memory.

This paper introduces Focus, a novel layer for sequence modeling that combines local attention with data-dependent Infinite Impulse Response (IIR) filters to efficiently handle long-range dependencies. The core idea is to pre-process the input sequence using adaptive filters before applying a standard, chunked local attention mechanism. This approach aims to overcome the limitations of traditional Transformers, such as their quadratic complexity with respect to sequence length and their often poor performance on tasks requiring very long contexts.

The Focus layer takes an input sequence $x \in \mathbb{R}^{L \times D}$ , where $L$ is the sequence length and $D$ is the feature dimension. It consists of several key components:

Adaptive IIR Filters: The central element is the use of dynamic (input-dependent) IIR filters of order two. These filters are applied globally along the sequence dimension. Unlike Finite Impulse Response (FIR) filters, IIR filters utilize feedback, allowing them to model longer dependencies with fewer parameters and achieve sharper frequency responses.
Hypernetwork: The coefficients of the IIR filters are not fixed parameters but are dynamically generated for each input sequence by a hypernetwork $H$ . This hypernetwork processes the input sequence $x$ to produce the filter coefficients $\Theta$ . The hypernetwork itself uses a global convolution layer followed by adaptive max pooling to generate a sequence embedding, which is then mapped to the filter coefficients by a small Multi-Layer Perceptron (MLP).
Causality: To maintain causality, which is important for auto-regressive tasks, the input sequence is split into non-overlapping "time bins" or chunks. The hypernetwork computes the filter coefficients for time bin $i$ based on information from time bin $i-1$ (by shifting the coefficients).
Filtering in Frequency Domain: The IIR filtering operation is performed efficiently in the frequency domain. The input sequence, chunked into time bins, is transformed using the Fast Fourier Transform (FFT) for each bin. The adaptive IIR filter's frequency response is calculated based on the dynamically generated coefficients. Filtering is then achieved by element-wise multiplication of the input's frequency representation with the conjugated filter response. After filtering, an Inverse FFT (IFFT) transforms the signal back to the time domain.
Chunking and Local Attention: The filtered sequence is split into non-overlapping chunks. Standard local self-attention with causal masking is applied independently to each chunk. This maintains computational efficiency by avoiding the quadratic cost of global attention.
Gating Mechanism: Following the attention mechanism, the filtered input and the attention output are combined using a gating mechanism similar to that used in the MEGA model, involving sigmoid-weighted linear units (SiLU) and sigmoid gates for combining the inputs and outputs.

Implementation Details:

The order of the IIR filters is specifically chosen as two because higher orders can introduce stability issues, while order two filters, with positive coefficients restricted between 0 and 1 (achieved via a sigmoid activation in the hypernetwork's final layer), are guaranteed to be stable. The paper analyzes the frequency response and stability of these filters.
The filtering process involves:
- Splitting input $x$ into time bins $x_r$ .
- Computing FFT for each $x_r$ : $X[\omega, r] = \text{FFT}(x_r)$ .
- Hypernetwork $H(x)$ generates coefficients $\Theta$ for $Nbins$ time bins, $D$ features, and $F$ filters, each of size 2.
- Shift $\Theta$ for causality.
- For each time bin $r$ and feature $d$ , apply filter: $X_f[\omega, r, d, f] = X[\omega, r, d] \cdot IIR_{imp}^*(\omega, \Theta[r-1, d, f])$ .
- Sum over filter bank dimension $F$ : $X_c = \sum_f X_f$ .
- Compute IFFT: $x_f = \text{IFFT}(X_c)$ .
The filtered sequence $x_f$ is then chunked $x_f^i$ and fed to local attention: $y^i = \text{Atten}(Qx^i, Kx_f^i, Vx_f^i)$ , where $Q, K, V$ are learned projection matrices.
The final output $o$ is computed using update and reset gates from $x_f$ and the attention output $y$ .

Analysis and Practical Implications:

Stability: The paper shows that with positive coefficients, order two IIR filters are stable, ensuring that the impulse response decays over time, a property identified as crucial for long-range modeling.
Expressiveness: The analysis demonstrates that diagonal State Space Models (SSMs) or diagonal linear RNNs, which have shown strong performance on long sequences, can be seen as constrained versions of order 1 or 2 IIR filters. This provides a theoretical link and suggests the potential expressiveness of IIR filters.
Time Complexity: The Focus layer achieves sub-quadratic time complexity $O(DL \cdot (\log(L) + M))$ , where $M$ is the chunk size for local attention. This is dominated by the global convolution in the hypernetwork ( $O(L \log L \cdot D)$ ) and the local attention ( $O(CM^2D)$ which becomes $O(LMD)$ since $C=L/M$ ). If $M$ is small, this is close to linear in $L$ .
Efficiency and Performance: Experiments demonstrate that Focus performs on par with or better than state-of-the-art models like Hyena, MEGA, and GPT2 on long-range associative recall, LLMing (enwiki8, Text8), and 1D image classification (sMNIST, pMNIST). Crucially, it achieves this with significantly fewer parameters than models like GPT2 and outperforms ablations without the adaptive hypernetwork ("Focus-H"), highlighting the value of dynamic filtering. It also shows lower peak memory usage compared to baselines.

Limitations:

The paper focuses on integrating IIR filters into a hybrid architecture and does not explore models based solely on IIR filters.
Different types of global convolutions within the hypernetwork were not extensively studied.
While the recurrent form of IIR filters can offer efficiency for auto-regressive inference, the presented implementation focuses on the convolutional form for training and does not detail the recurrent computation for inference speed-up.

In summary, Focus offers a practical and effective approach to building efficient long-sequence models by introducing a novel layer that employs data-dependent, stable IIR filters guided by a hypernetwork to process inputs before local attention. Its competitive performance and efficiency across various long-range tasks make it a promising alternative to existing architectures.