Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression (2407.12077v1)

Published 16 Jul 2024 in cs.CL and cs.AI

Abstract: We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Daniel Goldstein (15 papers)
  2. Fares Obeid (4 papers)
  3. Eric Alcaide (8 papers)
  4. Guangyu Song (6 papers)
  5. Eugene Cheah (5 papers)
Citations (7)

Summary

An Overview of GoldFinch: A High-Performance RWKV/Transformer Hybrid

Introduction

The paper "GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression" presents a novel hybrid model architecture that combines features of Linear Attention, traditional Transformers, and the Finch (RWKV-6) architectures. The significant contribution lies in efficiently generating and managing a highly compressed Key-Value Cache (KV-Cache) which, when coupled with innovative architectural elements, yields improved performance and memory efficiency during inference.

Core Innovations

The GoldFinch architecture introduces the following key innovations:

  1. Finch-C2 Layers: An enhanced variant of the Finch architecture employed for the initial layers to handle time-mixing efficiently.
  2. TokenCat: A mechanism for compressing the key cache to a minimal size, enabling significant reductions in memory usage.
  3. GOLD Layers: A novel transformer layer consuming the compressed keys to produce outputs without a traditional value cache.

The architecture is delineated into two primary components:

  • The Finch-C2 layers, which are responsible for linear recurrent processing.
  • The GOLD layers, which handle the final stages of attention-based processing using the compact key cache.

Methodology

The key features and methodologies in GoldFinch are highlighted as follows:

  • KV-Cache Compression: A new token concatenation technique called "TokenCat" reduces the KV-Cache size dramatically. This technique involves a two-step decompression method that combines the benefits of low-rank approximation and concatenation with the original embeddings.
  • Layer Specific Designs: By using the Finch-C2 layers for the initial two-thirds of the model, GoldFinch harnesses efficient time-mixing properties. In the final third, the revolutionary GOLD layers manage the task of attention without storing values, substantially reducing memory usage.

Experimental Evaluation

GoldFinch models, trained up to 1.45 billion parameters and on 1.5 trillion tokens, were evaluated against traditional Finch and Llama models. The results demonstrated:

  • Memory Efficiency: The GoldFinch KV-Cache is 756-2550 times smaller compared to traditional transformers, facilitating inference on extended context lengths with limited hardware.
  • Pre-fill Time Complexity: Utilizing a RNN-based cache generation approach reduces the initial context computation time to O(1) per token, contrasting the O(n) complexity in conventional methods.
  • Downstream Benchmark Performance: The GoldFinch models showcased enhanced performance across various benchmarks, achieving lower perplexity and better associative recall without sacrificing performance due to the use of full attention mechanisms.

Theoretical and Practical Implications

The implications of this research are broad:

  • Theoretical Advances: The hybrid architecture advances our understanding of combining RNN with transformer attention mechanisms, presenting a framework that maximizes efficiency while ensuring robust performance.
  • Practical Deployments: By significantly reducing the memory footprint of the KV-Cache and leveraging efficient time-mixing techniques, GoldFinch opens avenues for deploying large contextual models on consumer-grade hardware. This makes it an attractive model for applications requiring extended context processing, such as document analysis and long-form content generation.

Future Directions

Future research could focus on expanding the scope of GoldFinch:

  • Larger Models and Extended Training: Training models with more parameters and tokens on diverse datasets to examine the scalability and efficacy of the approach.
  • Integration with Other Architectures: Adapting the GoldFinch methodology to other linear attention and SSM architectures may yield further improvements.
  • Advanced Compression Techniques: Exploring quantization and other advanced compression strategies for further memory reduction.

Conclusion

"GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression" introduces a potent hybrid model that melds RNN efficiency with transformer attention capabilities, presenting a versatile framework for high-performance LLMing. Its design principles and empirical successes mark a significant step toward resource-efficient long-context LLMs, promising broad applications in various fields of AI and natural language processing.