Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 94 tok/s Pro

Kimi K2 216 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems (1512.08756v5)

Published 29 Dec 2015 in cs.LG and cs.NE

Abstract: We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.

Citations (292)

View on Semantic Scholar

Summary

The paper introduces a feed-forward model with a simplified attention mechanism to address long-term memory issues in tasks like addition and multiplication.
It achieves enhanced efficiency and near-perfect accuracy on sequences up to 10,000 time steps compared to traditional RNNs.
The results imply that attention-augmented feed-forward networks can be applied to practical tasks such as document classification where sequence order is less critical.

Overview of Feed-Forward Networks with Attention in Solving Long-Term Memory Problems

The paper presents a paper on how feed-forward neural networks, augmented with a simplified attention mechanism, can effectively solve certain long-term memory problems, specifically focusing on synthetic tasks like "addition" and "multiplication." Traditionally, models addressing sequential data, such as Recurrent Neural Networks (RNNs), experience difficulties with very long sequences due to computational inefficiencies and the vanishing/exploding gradient issues during Backpropagation Through Time (BPTT). This research introduces a non-recurrent approach that utilizes a form of attention to enable feed-forward networks to manage long-term dependencies across variable sequence lengths.

Key Concepts and Models

Attention Mechanism: The paper follows the foundational concept of attention in neural networks, which provides a direct means for a model to access different parts of a sequence directly. This is established via a "context" vector computed as a weighted sum of the sequence's states.
Feed-Forward Attention: The proposed model simplifies the traditional attention mechanism by producing a single vector summarizing the input sequence. This is achieved by computing a learnable, adaptive weighted average of the sequence states in a feed-forward fashion, thereby enabling full parallelization of computations.
Long-Term Memory Tasks: The performance of the model was evaluated using synthetic tasks designed to measure long-term memory capabilities, particularly those established by Hochreiter. These tasks test models on their ability to handle dependencies that span arbitrary long sequences.

Experimental Setup and Results

The experiments were conducted on the addition and multiplication tasks for sequences as long as 10,000 time steps, which surpasses the capability of many conventional methods. By leveraging a feed-forward architecture enhanced with attention, the paper reports:

Enhanced efficiency and reduced computation time compared to traditional RNN approaches.
For the attention-based model, a clear improvement in solving long-term memory tasks across all tested sequence lengths, compared to the unweighted integration approach.
Successful handling of sequences varying widely in length, with nearly perfect accuracy attained in some cases. This was not achievable with a simple unweighted averaging approach.

Implications and Future Work

The findings suggest significant potential for attention-augmented feed-forward networks to solve various real-world problems where sequence order is less important than handling large, variable lengths. Document classification, where word order may be less crucial, is one cited example.

The research indicates that attention mechanisms allow models to dynamically refer to specific sequence points, supporting the claims of their beneficial use in managing sequences of varying and potentially vast lengths.

Future developments could extend the application of this type of model to other domains requiring efficient processing of sequential data without sacrificing the ability to model long-term dependencies. Further exploration may involve fine-tuning and adapting these models for different types of data beyond synthetic tasks, potentially offering enhancements in performance and computational efficiency in practical applications.