Emergent Abilities of LLMs
Introduction
LLMs have seen remarkable progress in recent years. A fascinating phenomenon observed in these models, particularly those of a larger scale, is the development of unexpected capabilities—known as emergent abilities. These abilities, interestingly, do not manifest in smaller models but begin to appear in larger versions, presenting a performance trend defying simple extrapolations from their less sizable counterparts. Emergence in this context is defined as qualitative changes in behavior originating from quantitative increases in a system—in this case, the size of the LLM as gauged by the number of parameters and computational resources expended during training.
Emergent Abilities Defined
Emergence in LLMs is evident when there's a significant leap in model performance that transcends the predictable gains seen with smaller models. A distinct attribute of emergent abilities is their phase transition-like nature. Initially, the model's performance on a task may randomize as if the model lacks the ability entirely. Then, past a certain model scale threshold, performance sharply increases. This behavior is akin to phase transitions in physics where a substantial change in state reveals non-trivial properties that were not foreseeable. Notably, most densely built Transformer LLMs follow this trend since they usually scale their computational training resources in proportion to model parameters.
Observations in Prompt-Based Tasks
The unpredictability of emergent abilities is particularly striking in prompting paradigms, where an LLM produces responses based on predefined inputs without further training modifications. A prime example is the response improvement in few-shot prompted tasks that LLMs like PaLM and GPT-3 exhibit only after reaching extremely high numbers of parameters and computational training. These improvements were recorded across a battery of such tasks, from arithmetic to transliteration, indicating a broad spectrum of emergent abilities.
Augmented Prompting Techniques
Besides the raw scaling of LLMs, researchers have also investigated various enhanced prompting and fine-tuning strategies, which may qualify as emergent abilities if they are detrimental or show no effect until applied at a certain scale. Examples include chain-of-thought prompting, which facilitates multi-step reasoning, and scratchpad methodologies that assist with sequential operations. Techniques for model calibration have also been observed to be effective only at higher scales.
Conclusion
The research frontier for LLMs includes identifying the limits of their emergent abilities, especially since these capabilities challenge our current understanding of model predictability. The premise is that with additional scaling, even more sophisticated abilities may emerge. However, achieving emergence might also be possible without simply increasing model scale, potentially through improved architectures, training methods, data quality, or tasks that emphasize current model weaknesses. These findings heighten the need for the computational linguistics community to explore the causality and dynamics of LLMs' emergent behaviors.