Memory Augmented Language Models through Mixture of Word Experts (2311.10768v1)

Published 15 Nov 2023 in cs.CL

Abstract: Scaling up the number of parameters of LLMs has proven to be an effective approach to improve performance. For dense models, increasing model size proportionally increases the model's computation footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms regular MoE models on knowledge intensive tasks and has similar performance to more complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces MoWE which decouples learning capacity from compute overhead by employing sparse, word-specific experts.
It demonstrates significant improvements over T5 and previous MoE models on knowledge-intensive benchmarks like TriviaQA.
The scalable architecture utilizes hierarchical routing and frozen specialists to boost training efficiency and enhance generalization.

Memory Augmented LLMs through Mixture of Word Experts

This paper introduces an innovative framework named Mixture of Word Experts (MoWE) for enhancing the performance of LLMs by efficiently decoupling learning capacity from computational overhead. Unlike traditional dense models where an increase in parameters directly translates to higher computational costs, MoWE takes inspiration from the Mixture-of-Experts (MoE) paradigm to offer a more efficient alternative. By employing a sparse memory mechanism through word-specific experts, MoWE showcases an effective strategy to navigate the trade-offs in LLM scaling, particularly when handling knowledge-intensive tasks.

Key Contributions and Findings

This paper presents several noteworthy contributions:

Efficient Sparse Modeling: The MoWE architecture innovatively incorporates sparse memory through a large set of word-specific experts, contrasting the traditional dense feedforward networks. This approach results in significant performance improvements over models with similar FLOPs, such as T5, especially on tasks demanding extensive world knowledge.
Improved Knowledge Retrieval: Empirical evaluations reveal that MoWE significantly surpasses not only T5 models but also previous MoE models in knowledge-intensive benchmarks like TriviaQA. MoWE's effective handling of questions requiring detailed retrieval of world information underscores its potential utility in such applications.
Scalable Architecture: The architecture allows for an extremely large set of experts (e.g., 32K experts in MoWE-Base) compared to typical MoE models. This scalability is achieved through hierarchical routing and dynamic expert assignment, promoting specialization and increasing retrieval accuracy.
Routing Efficiency: MoWE utilizes a large auxiliary vocabulary for routing, linking input tokens to specific experts, thereby facilitating specialized knowledge representations. This strategy contributes to a more targeted, efficient retrieval process that is also computationally viable.
Training Efficiency: The innovative use of frozen specialists during fine-tuning safeguards against overfitting and preserves the interpretative capacity of the pretrained vocabulary, specifically aiding generalization on unseen data.

Implications and Future Directions

MoWE's design innovatively bridges the gap between memory-augmented models and traditional MoE approaches, offering substantial improvements on tasks demanding factual memory and retrieval. The seamless integration of experts as memory components opens several pathways for future developments:

Enhanced Memory Augmentation: Future work could explore the dynamic construction and adaptation of the routing vocabulary to better capture the nuances of domain-specific tasks. Leveraging advances in adaptive routing or meta-learning could further enhance model performance across diversified contexts.
Cross-Domain Applicability: Expanding MoWE's applicability beyond traditional language tasks by embedding domain-specific experts can facilitate its extension into multimodal applications where language is a pivotal component.
Optimization of Lexical Routing: Refining lexical-driven routing strategies and optimizing the balance between expert number and performance remains a largely unexplored area. Investigating alternate expert configurations and optimization strategies could yield even more efficient architectures.

In summary, the development of MoWE represents a notable advance in LLM design, particularly in its approach to fetching and utilizing world knowledge efficiently. Its implications for both practical applications and theoretical understanding of LLMs make it a significant contribution to the field of Natural Language Processing.

PDF Markdown

Related Papers

YouTube

Show All Videos