Understanding LLMs
Training Objectives and LLM Behavior
The widespread deployment of LLMs like GPT-3.5 and GPT-4 necessitates an understanding of their strengths and limitations. It is posited that to truly grasp the capabilities of LLMs, one must consider the problem these models have been trained to solve: predicting the next word in a sequence, using Internet text as a substrate. Recognizing this training goal—the essence of their autoregressive nature—and the environment of their operation leads to insights about their performance.
Factors Influencing LLM Performance
Research presents a "teleological" approach, prioritizing the goals and environment that shape LLMs. This perspective presupposes LLM accuracy is influenced by:
- Task probability: LLMs excel at tasks reflecting high-frequency examples in training data.
- Output probability: Deterministic tasks notwithstanding, models lean towards higher accuracy for more probable outputs.
- Input probability: Effectiveness may be impacted by the provided input's likelihood, although less than output probability.
Empirical Validation
Evaluations encompass eleven distinct tasks, revealing three key influences:
- LLM accuracy aligns with task frequency; common tasks bring greater success than their rare counterparts.
- Even when tasks don't rely on it, the probability of target outputs can unexpectedly dictate LLM performance.
- While input probability partially shapes LLM behavior, it's overshadowed by the decisive sway of output probability.
What stands out is an asymmetry; models are more affected by the likelihood of what they generate (outputs) than by the likelihood of the information they receive (inputs).
Beyond Probability: Other Characteristic Phenomena
- Lack of Embodiment: LLMs may fumble tasks easily solved by humans using physical interaction, e.g., applying a keyboard-based cipher.
- Sensitivity to Wording: The exact phrasing, even for similar ideas, can elicit divergent LLM responses, revealing a heavy reliance on language patterns.
Implications for LLM Application
The work advises caution when employing LLMs for rare tasks (due to probability biases) and situations requiring low-probability text generation. Advanced prompting strategies and scaling might uplift model performance, but fundamental tendencies persist, stressing the need for an approach informed by the intrinsic training nature of LLMs.
Closing Thoughts
As LLMs continue to advance in capability, comprehending their ingrained biases and operational nuances becomes more critical. This paper underscores the importance of aligning LLM evaluations with their foundational training aspects to navigate their capabilities and boundaries accurately.