Dice Question Streamline Icon: https://streamlinehq.com

Scaling behavior and practical relevance of inductive out-of-context reasoning (OOCR)

Establish the extent to which inductive out-of-context reasoning (OOCR) in large language models scales to learning more complex latent variables and determine its practical relevance for current large language models.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces inductive out-of-context reasoning (OOCR), where a LLM is finetuned on documents that contain only indirect evidence about latent information and is then evaluated on out-of-distribution queries that depend on that latent information. Across five tasks (Locations, Coins, Functions, Mixture of Functions, and Parity Learning), the authors show evidence that GPT-3.5 and GPT-4 can perform OOCR, sometimes outperforming in-context learning baselines.

However, the authors also find that absolute OOCR performance is unreliable, especially for smaller models and more complex latent structures (e.g., mixtures of functions). This raises uncertainty about how well OOCR can extend to more complex latents and whether it is practically relevant at current model scales.

References

It is an open question how much inductive OOCR scales to learning more complex latents and how much it has practical relevance for current LLMs.

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data (2406.14546 - Treutlein et al., 20 Jun 2024) in Introduction (Section 1)