Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-Time Training on Nearest Neighbors for Large Language Models (2305.18466v3)

Published 29 May 2023 in cs.CL and cs.LG

Abstract: Many recent efforts augment LLMs with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 LLMing tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for LLMing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning, pp. 468–485. PMLR, 2022.
  2. Locally weighted learning. Lazy learning, pp.  11–73, 1997.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
  4. Local learning algorithms. Neural computation, 4(6):888–900, 1992.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. William S Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368):829–836, 1979.
  8. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403):596–610, 1988.
  9. Large scale transductive svms. Journal of Machine Learning Research, 7(8), 2006.
  10. Learning by transduction. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp.  148–155. Morgan Kaufmann Publishers Inc., 1998.
  11. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  13. A framework for few-shot language model evaluation. Zenodo, September 2021. doi: 10.5281/zenodo.5371628. URL https://doi.org/10.5281/zenodo.5371628.
  14. Demix layers: Disentangling domains for modular language modeling. arXiv preprint arXiv:2108.05036, 2021.
  15. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  16. Efficient nearest neighbor language models. arXiv preprint arXiv:2109.04212, 2021.
  17. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  18. Thorsten Joachims. Learning to classify text using support vector machines, volume 668. Springer Science & Business Media, 2002.
  19. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  20. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  21. Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pp. 2766–2775. PMLR, 2018.
  22. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019.
  23. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  24. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  25. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90(432):1257–1270, 1995.
  26. Charles J Stone. Consistent nonparametric regression. The annals of statistics, pp.  595–620, 1977.
  27. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pp. 9229–9248. PMLR, 2020.
  28. Learning to (learn at test time). arXiv preprint arXiv:2310.13807, 2023.
  29. Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
  30. Test-time training on video streams. 2022a.
  31. Training data is more valuable than you think: A simple and effective method by retrieving from training data. arXiv preprint arXiv:2203.08773, 2022b.
  32. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  2126–2136. IEEE, 2006.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com