Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models (2210.13432v2)

Published 24 Oct 2022 in cs.CL

Abstract: LLMs (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive zero-shot and few-shot capabilities across a wide range of tasks. In this work, we propose a simple technique that significantly boosts the performance of LLMs without adding computational cost. Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations for downstream language understanding tasks. We hypothesize that randomly masking past tokens prevents over-attending to recent tokens and encourages attention to tokens in the distant past. We find that our method, Forgetful Causal Masking (FCM), significantly improves both few-shot and finetuning performance of PaLM. We further consider a simple extension, T-FCM, which introduces bidirectional context to causal LLM without altering the sequence order, and further improves finetuning performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hao Liu (497 papers)
  2. Xinyang Geng (21 papers)
  3. Lisa Lee (25 papers)
  4. Igor Mordatch (66 papers)
  5. Sergey Levine (531 papers)
  6. Sharan Narang (31 papers)
  7. Pieter Abbeel (372 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com