Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Nugget: Neural Agglomerative Embeddings of Text (2310.01732v1)

Published 3 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a LLM (LM), suggesting new future LMs that can condition on significantly larger amounts of content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp.  1–61, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5301. URL https://aclanthology.org/W19-5301.
  2. Latent dirichlet allocation. Journal of Machine Learning Research (JMLR), 3(Jan):993–1022, 2003.
  3. Generating Sentences from a Continuous Space. In The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2016.
  4. Language Models are Few-Shot Learners, 2020.
  5. Semantic Re-Tuning with Contrastive Tension. In International Conference on Learning Representations (ICLR), 2021.
  6. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. In International Workshop on Semantic Evaluation, 2017.
  7. PaLM: Scaling Language Modeling with Pathways, 2022.
  8. What Does BERT Look At? An Analysis of BERT’s Attention. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  9. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  11. Falcon, W. and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
  12. Gage, P. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
  13. Condenser: A Pre-training Architecture for Dense Retrieval. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
  14. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
  15. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. In Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  16. Deep Residual Learning for Image Recognition. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  17. Universal Language Model Fine-tuning for Text Classification. In Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
  18. ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation. In Association for the Advancement of Artificial Intelligence (AAAI), 2019.
  19. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In International Joint Conference on Natural Language Processing (IJCNLP), 2015.
  20. Junczys-Dowmunt, M. Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation. In Conference on Machine Translation (WMT), 2019.
  21. Dense Passage Retrieval for Open-Domain Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  22. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In ACM Special Interest Group on Information Retreival (SIGIR), 2020.
  23. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2015.
  24. Skip-Thought Vectors. In Conference on Neural Information Processing Systems (NeurIPS), 2015.
  25. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284, 1998.
  26. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
  27. On the Sentence Embeddings from Pre-trained Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  28. Diffusion-LM Improves Controllable Text Generation. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  29. Variational Information Bottleneck for Effective Low-Resource Fine-Tuning. In International Conference on Learning Representations (ICLR), 2021.
  30. Learned in Translation: Contextualized Word Vectors. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
  31. Pointer Sentinel Mixture Models, 2016.
  32. Efficient Estimation of Word Representations in Vector Space, 2013.
  33. Domain-matched Pre-training Tasks for Dense Retrieval. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
  34. BLEU: A method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2002.
  35. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
  36. GloVe: Global Vector for Word Representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
  37. Deep contextualized word representations. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  38. The NLP Task Effectiveness of Long-Range Transformers. In Annual Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023.
  39. Compressive Transformers for Long-Range Sequence Modelling. In International Conference on Learning Representations (ICLR), 2020.
  40. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  41. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  42. Skip-Prop: Representing Sentences with One Vector Per Proposition. In International Conference on Computational Semantics (IWCS), 2017.
  43. A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
  44. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning, 2020.
  45. Efficient Transformers: A Survey. ACM Computing Surveys, 55(6):1–28, 2022.
  46. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2146–2157, 2020.
  47. TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
  48. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  49. Transformers: State-of-the-Art Natural Language Processing. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  50. Multi-View Document Representation Learning for Open-Domain Dense Retrieval. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.