Out-of-distribution generalisation is hard: evidence from ARC-like tasks (2505.09716v2)

Published 14 May 2025 in cs.LG and cs.AI

Abstract: Out-of-distribution (OOD) generalisation is considered a haLLMark of human and animal intelligence. To achieve OOD through composition, a system must discover the environment-invariant properties of experienced input-output mappings and transfer them to novel inputs. This can be realised if an intelligent system can identify appropriate, task-invariant, and composable input features, as well as the composition methods, thus allowing it to act based not on the interpolation between learnt data points but on the task-invariant composition of those features. We propose that in order to confirm that an algorithm does indeed learn compositional structures from data, it is not enough to just test on an OOD setup, but one also needs to confirm that the features identified are indeed compositional. We showcase this by exploring two tasks with clearly defined OOD metrics that are not OOD solvable by three commonly used neural networks: a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and a Transformer. In addition, we develop two novel network architectures imbued with biases that allow them to be successful in OOD scenarios. We show that even with correct biases and almost perfect OOD performance, an algorithm can still fail to learn the correct features for compositional generalisation.

Summary

Out-of-Distribution Generalisation: Challenges and Insights from ARC-like Tasks

Out-of-distribution (OOD) generalisation remains a significant challenge for AI, as it demands a system to transcend interpolation between observed data points and instead leverage task-invariant compositional features. The paper "Out-of-distribution generalisation is hard: evidence from ARC-like tasks" presents empirical evidence to highlight the difficulty of achieving true compositional generalisation in OOD scenarios, particularly by focusing on ARC-like tasks.

Core Concepts and Methodology

The paper initiates its exploration by positing that mere success in OOD tasks does not imply genuine compositional learning. To support this thesis, the authors argue that measuring compositional generalisation should include an assessment of whether the identified features are compositional beyond mere data-specific biases. The research scrutinises the OOD capabilities of common neural network architectures—MLPs, CNNs, and Transformers—through uniquely constructed datasets that model OOD phenomena.

Central to this research are two purposefully crafted world models based on the Abstract and Reasoning Corpus (ARC): Translate and Rotate. These models manifest as complex image transformation tasks, where pixel manipulation must occur without explicit prior examples of similar transformations. By evaluating a range of architectures, supplemented by two novel networks equipped with specific biases, the paper rigorously tests ARC-like tasks to examine latent feature compositionality.

Results

The results reveal that conventional architectures falter in achieving meaningful compositional OOD generalisation. Among the algorithms tested, bias-engineered architectures exhibited superior OOD performance under specific conditions, yet failed to generate compositionally interpretable latent features consistently across disparate tasks.

Notably, the Axial Pointer Linear network with inductive biases showed promising results in the Translate task's Distance 2 test set. Contrarily, it struggled with the Rotate task, demonstrating how engineered biases can lead to apparent OOD generalisation in a limited context. This reinforces the notion that biases can spur misleading conclusions about a network's capacity for systematic generalisation.

Analysis of Errors

The paper provides a nuanced analysis of the type of errors committed by various architectures, shedding light on the absence of true compositional rule-learning. Transformers, though performing marginally better in representation recombination tasks than MLPs and CNNs, exhibited errors indicative of incomplete compositional structures. Meanwhile, errors from bespoke networks underscored the randomness in pixel manipulation without coherent compositional integration, further emphasising the challenge of OOD generalisation.

Implications and Future Directions

The implications of these findings suggest that prevalent benchmarks for compositional algorithms may not suffice in assessing true OOD generalisation. Consequently, further exploration is warranted, potentially involving a wider array of architectures and diverse OOD task benchmarks. Future development should focus on innovative mechanisms for latent space visualisation, ensuring that algorithms are assessed on not only their superficial performance but also on the meaningful compositional structures they possess.

The paper articulates critical limitations, acknowledging a constrained number of algorithms tested and suggesting broader hyperparameter explorations and model size variations. Importantly, it calls for examining models involving compositional actions to explore language-based OOD algorithms, paving the way for advances in AI that mirror the nuanced generalisation capabilities inherent in human cognition.

Through rigorous methodology and insightful results, this paper serves as a crucial reference point for researchers pursuing the development of AI systems capable of robust OOD generalisation, underscoring the intricate relationship between algorithmic biases, compositional representation, and task invariance.