Length-Induced Embedding Collapse in PLM-based Models (2410.24200v2)
Abstract: Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of 0.94% on MTEB and 1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at https://github.com/Yuqi-Zhou/Length_Collapse.
- A survey of text clustering algorithms. Mining text data, pp. 77–128, 2012.
- Semeval-2012 task 6: A pilot on semantic textual similarity. in* sem 2012: The first joint conference on lexical and computational semantics–volume 1: Proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation (semeval 2012). Association for Computational Linguistics., 2012.
- * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pp. 32–43, 2013.
- Dimo Angelov. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470, 2020.
- Generating a word-emotion lexicon from# emotional tweets. In Proceedings of the third joint conference on lexical and computational semantics (* SEM 2014), pp. 12–21, 2014.
- Ergun Biçici. Rtm-dcu: Predicting semantic similarity with referential translation machines. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp. 56–63, 2015.
- A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38, pp. 716–722, 2016.
- A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
- Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807, 2020.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024.
- Summscreen: A dataset for abstractive screenplay summarization. arXiv preprint arXiv:2104.07091, 2021.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
- Specter: Document-level representation learning using citation-informed transformers. arXiv preprint arXiv:2004.07180, 2020.
- Evaluation of sentence representations in Polish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1674–1680, 2020.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, pp. 391–409, 2021.
- A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 6491–6501, 2024.
- Lawrence Fenton. The sum of log-normal probability distributions in scatter transmission systems. IRE Transactions on communications systems, 8(1):57–67, 1960.
- Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. arXiv preprint arXiv:2204.08582, 2022.
- Simcse: Simple contrastive learning of sentence embeddings. In EMNLP 2021-2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2021.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Tweac: transformer with extendable qa agent classifiers. arXiv preprint arXiv:2104.07081, 2021.
- Leo A Goodman. On the exact variance of products. Journal of the American statistical association, 55(292):708–713, 1960.
- Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023.
- Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, 2020.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186, 2019.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, pp. 317–328, 2018.
- Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums. In Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering, pp. 2–5, 2018.
- Roberta: A robustly optimized bert pretraining approach. arXiv, 2019.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150, 2011.
- Carl D Meyer. Matrix analysis and applied linear algebra, volume 71. SIAM, 2000.
- Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037, 2023.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
- Linear log-normal attention with unbiased concentration. In The Twelfth International Conference on Learning Representations, 2024.
- SemEval-2016 task 4: Sentiment analysis in Twitter. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9844–9855, 2022.
- Nomic embed: Training a reproducible long context text embedder. arXiv preprint arXiv:2402.01613, 2024.
- Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Task-oriented intrinsic evaluation of semantic textual similarity. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 87–96, 2016.
- V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 410–420, 2007.
- Benchmarking and building long-context retrieval models with loco and m2-bert. arXiv preprint arXiv:2402.07440, 2024.
- Carer: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 3687–3697, 2018.
- Biosses: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics, pp. i49–i58, 2017.
- Aivin V Solatorio. Gistembed: Guided in-sample selection of training negatives for text embedding fine-tuning. arXiv preprint arXiv:2402.16829, 2024.
- Evaluation of chatgpt as a question answering system for answering complex questions. arXiv, 2023.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Fact or fiction: Verifying scientific claims. arXiv preprint arXiv:2004.14974, 2020.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2024.
- Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations, 2022b.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Mind: A large-scale dataset for news recommendation. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 3597–3606, 2020.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, 2021.
- Search-in-the-chain: Interactively enhancing large language models with search for knowledge-intensive tasks. In Proceedings of the ACM on Web Conference 2024, pp. 1362–1373, 2024.
- Consert: A contrastive framework for self-supervised sentence representation transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5065–5075, 2021.
- Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining, pp. 1154–1156, 2021.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 2015.
- Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst., 2023.
- Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021.
- Longembed: Extending embedding models for long context retrieval. arXiv preprint arXiv:2404.12096, 2024.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.