Nugget: Neural Agglomerative Embeddings of Text (2310.01732v1)

Published 3 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a LLM (LM), suggesting new future LMs that can condition on significantly larger amounts of content.

References (50)

Citations (15)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

Tweets

https://twitter.com/ben_vandurme/status/1901598769105526842

https://twitter.com/AgentzenithAi/status/1890920814213738657

https://twitter.com/AgentzenithAi/status/1891345428324757520

https://twitter.com/AgentzenithAi/status/1891761108937240926

https://twitter.com/AgentzenithAi/status/1891267195734548776

https://twitter.com/AgentzenithAi/status/1892316088479994281

Nugget: Neural Agglomerative Embeddings of Text (2310.01732v1)

Summary

Follow-up Questions

Related Papers

Authors (2)

Tweets