SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

Published 1 Aug 2024 in cs.AI and cs.CL | (2408.00655v5)

Abstract: Current LLMs primarily utilize next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aiming at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), which includes a Sentence Encoder to compress multiple tokens in a sentence into a single token, and a Sentence Decoder to reconstruct it. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference method. In addition, the SentenceVAE module of SLLMs can maintain the integrity of the original semantic content by segmenting the context into sentences, thereby improving accuracy while boosting inference speed. Moreover, compared to previous LLMs, SLLMs process fewer tokens over equivalent context length, significantly reducing memory demands for self-attention computation and facilitating the handling of longer context. Extensive experiments on Wanjuan dataset have revealed that the proposed method can accelerate inference speed by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the equivalent context length, compared to previous token-by-token methods.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SentenceVAE, which compresses entire sentences into single tokens to drastically reduce computational load.
It utilizes an encoder-decoder framework that accelerates inference by 204% to 365% and cuts memory overhead by up to 91%.
The approach preserves semantic integrity by reducing perplexity, paving the way for scalable, resource-efficient large language models.

Overview of SentenceVAE: Enhancing LLM Inference Efficiency

The paper introduces SentenceVAE, a novel variational autoencoder (VAE) designed to improve inference efficiency in LLMs by transitioning from a traditional next-token prediction paradigm to a next-sentence prediction approach. The motivation behind SentenceVAE arises from the significant computational overhead and time consumption inherent in current LLMs, which operate primarily through sequential token generation. By leveraging SentenceVAE, the authors propose a Sentence-level LLM (SLLM) that markedly enhances inference speed while maintaining or improving model accuracy.

Key Contributions

Sentence Compression and Reconstruction:
- SentenceVAE comprises an encoder that compresses the information within a sentence into a singular token and a decoder that reconstructs this token back into its original sentence form. This compression method enables the processing of fewer tokens over equivalent context lengths, optimizing resource usage.
Acceleration of Inference:
- The integration of SentenceVAE into an LLM results in SLLMs capable of performing sentence-by-sentence inference, which can accelerate inference speeds by 204% to 365%. This method mitigates the traditional computational burden associated with token-by-token prediction, as demonstrated in the experimental results.
Enhanced Resource Efficiency:
- The approach significantly reduces memory overhead for self-attention computations, achieving a reduction of up to 86% to 91% for equivalent context lengths, a notable advancement in minimizing memory demands during inference.
Maintenance of Semantic Integrity:
- By segmenting text at the sentence level, SentenceVAE ensures that the semantic integrity of the input is preserved, leading to a reduction in perplexity (PPL) to 46% to 75% of the original metric, thereby enhancing inference accuracy.

Method and Experimental Findings

The authors provide an in-depth exploration of SentenceVAE’s architecture, which utilizes self-attention mechanisms for sentence embedding. Sentences are first encoded into a compact representation prior to being input into the LLM, which then predicts sequences based on these condensed representations. Extensive experiments, using various LLM sizes (125M, 350M, and 1.3B parameters), validate the hypothesis that SLLMs not only speed up inference tasks but also improve PPL metrics in comparison to their token-based counterparts.

Implications and Future Directions

The implications of SentenceVAE are substantial in both theoretical and practical contexts. Theoretically, this approach paves the way for efficiently scaling LLMs while maintaining high performance levels. Practically, the reduced computational demand and improved processing speeds may facilitate deployment in environments with limited computational resources, such as edge devices.

Looking forward, the scalability of the SLLM framework to larger model architectures opens intriguing avenues for research, particularly in enhancing the comprehension and processing capabilities of LLMs in diverse domains. The potential integration of advanced architectural optimizations, like Rotational Position Encoding, could further improve the robustness and applicability of SentenceVAE in multilingual or multimodal contexts.

This paper's findings suggest that future LLM development could benefit from the adoption of sentence-level processing paradigms, enabling faster and more resource-efficient model deployments that retain excellence in language understanding and generation tasks.

Markdown Report Issue