Papers
Topics
Authors
Recent
Search
2000 character limit reached

MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

Published 23 Apr 2025 in cs.CL and cs.LG | (2504.16786v1)

Abstract: Recent advances in LLMs have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performance of a BERT-based compressor by mitigating the over-smoothing problem and incorporating outlier scores. In the training phase, we add an inter-class cosine similarity loss term to penalize excessively similar token representations, thereby improving the token classification accuracy. During the compression phase, we introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression. These scores are integrated with the classifier's output, making the compressor more generalizable to various tasks. Superior performance is achieved at various compression ratios on long-context understanding and reasoning benchmarks. Moreover, our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.

Summary

Overview of MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

Long-context processing by large language models (LLMs) has seen considerable advancements, yet it presents significant challenges in resource-constrained environments due to increased inference time and resource consumption. The paper titled "MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores" introduces an innovative method for enhancing a BERT-based compressor by addressing two primary issues: over-smoothing in token representations and the preservation of outlier or rare tokens.

Methodology

The MOOSComp system introduces an anti-over-smoothing mechanism and an outlier scoring procedure to improve token classification accuracy and generalization. During the training phase, the authors incorporate an inter-class cosine similarity loss term, which penalizes token representations that become excessively similar across classes. This loss term is simple and efficient, designed to maintain inter-class separation without introducing additional inference-time overhead. Moreover, this strategy directly targets the over-smoothing phenomenon observed in BERT-based models, wherein token representations often converge to nearly identical values, thereby hindering classification tasks.

In the compression phase, the method employs an outlier detection mechanism that calculates outlier scores for each token based on Z-score metrics. The compressor then integrates these scores with the classifier's output, enabling it to preserve tokens that may be rare yet crucial for specific tasks. This approach ensures that task-critical tokens with high outlier scores are more likely to be retained, thus improving the model's adaptability across different tasks and datasets.

Results

The experimental evaluations conducted across MeetingBank and various out-of-domain datasets such as LongBench, GSM8K, and BBH demonstrate the efficacy of MOOSComp. The proposed method significantly outperformed existing task-agnostic hard prompt methods, including LLMLingua-2, on multiple metrics, achieving superior prompt compression effectiveness and generalizability. Notably, MOOSComp delivered remarkable performance improvements on tasks involving long-context understanding and reasoning with different LLMs, such as GPT-3.5-Turbo and Qwen2.5, under varying token constraints.

Quantitative results highlighted the compression performance across different datasets, achieving enhanced BLEU and Rouge scores for tasks within MeetingBank and providing increases in exact match accuracy on reasoning tasks such as GSM8K and BBH. Additionally, MOOSComp demonstrated a substantial speedup in inference times, particularly impactful in resource-constrained environments like edge devices, showcasing a speedup of up to 3.3x on a smartphone and 3.2x on a GPU.

Implications and Future Work

The paper advances the practical deployment of LLMs by addressing crucial efficiency challenges through token compression. By alleviating over-smoothing and enhancing the retention of task-critical tokens, MOOSComp positions itself as a reliable tool for scalable applications in constrained computing environments.

For future research, further exploration could focus on extending the applicability of MOOSComp to more diverse types of compressors and varied architectures beyond BERT-based models. The integration of adaptive mechanisms for dynamic adjustment of the balance between classifier outputs and outlier scores could refine task performance. Moreover, examining the use of MOOSComp in streaming contexts and online learning environments where prompt compression must be executed in real-time could provide valuable insights and improvements.

The robustness and versatility of MOOSComp indicate promising extensions and refinements that could support the next generation of efficient LLM deployments, particularly as the demand for handling larger context windows and real-time processing continues to grow.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.