Measuring Compositional Generalization: A Comprehensive Method on Realistic Data (1912.09713v2)

Published 20 Dec 2019 in cs.LG, cs.CL, and stat.ML

Abstract: State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

Authors (14)

Daniel Keysers (19 papers)
Nathanael Schärli (8 papers)
Nathan Scales (8 papers)
Hylke Buisman (1 paper)
Daniel Furrer (2 papers)
Sergii Kashubin (4 papers)
Nikola Momchev (12 papers)
Danila Sinopalnikov (7 papers)
Lukasz Stafiniak (2 papers)
Tibor Tihon (2 papers)
Dmitry Tsarkov (3 papers)
Xiao Wang (507 papers)
Marc van Zee (6 papers)
Olivier Bousquet (33 papers)

Citations (331)

View on Semantic Scholar

Summary

The paper introduces DBCA, a novel benchmark method that assesses compositional generalization using realistic data splits.
It leverages divergence-optimized splits to reveal models’ overreliance on memorization over true component recombination.
Empirical results on the CFQ dataset show that state-of-the-art NLP architectures struggle with compositional learning.

A Formal Overview of "Measuring Compositional Generalization: A Comprehensive Method on Realistic Data"

The paper "Measuring Compositional Generalization: A Comprehensive Method on Realistic Data" addresses the critical challenge of assessing the compositional generalization abilities of modern ML models, particularly within NLP tasks. Despite advancements in ML, state-of-the-art methods struggle to leverage compositional structures like those seen in human intelligence. This paper introduces a novel benchmark method, termed Distribution-Based Compositionality Assessment (DBCA), for rigorously evaluating compositional generalization by employing realistic and large-scale data.

Key Concepts

Compositional Generalization: This is the capacity of ML models to generalize learned knowledge of components to new compositions. The benchmark assesses how well a model performs on unseen combinations of learned components.
Compound and Atom Divergence: The framework developed in the paper maximizes the divergence in compound compositions between the training and test datasets while minimizing the divergence in atomic components. This approach aims to specifically challenge models on their ability to recombine known components.

The CFQ (Compositional Freebase Questions) dataset, a large and diverse set of natural language queries, is constructed to explicitly test these capabilities. It is designed to mirror realistic linguistic structures while still incorporating a compositional nature that current ML models should theoretically understand but struggle to actualize.

Methodology

The authors employ a systematic approach for dataset generation and splitting:

Automatic Rule-Based Generation: The CFQ dataset is automatically generated using a set of deterministic rules ensuring a clear tracking of the compositional constructs within each example. This method ensures a comprehensive and precise description of the data's compositional properties.
Divergence-Optimized Splits: To measure compositional generalization, the paper leverages a novel DBCA approach. Systems' performance is evaluated on datasets split to maintain low atomic divergence but high compound divergence, highlighting the models' ability to recombine components rather than memorize specific inputs and outputs.

Results and Analytical Insights

The empirical analysis on CFQ demonstrates several key findings:

Unexpected Correlations: The negative correlation between compound divergence and model accuracy signifies that current architectures rely heavily on memorizing instances rather than understanding compositional rules.
Cross-Architecture Insights: Tested on three state-of-the-art ML architectures (LSTM+attention, Transformer, Universal Transformer), similar patterns emerge, revealing a systemic deficiency in compositional learning abilities across these models.
Experiment Comparisons: The paper evaluates traditional compositionality assessments like input/output pattern splits, highlighting the superiority of the DBCA approach in exposing true compositional deficiencies in ML models.

Implications and Future Directions

This research offers significant implications both for the practical application of AI models and for theoretical advances in AI comprehension:

Dataset Impact: CFQ sets a new benchmark in evaluating compositionality, crucial for systems aiming to mimic human-like generalization.
Model Improvement Avenues: The findings suggest a necessity for model innovations, potentially adopting hybrid or novel architectures that surpass mere pattern recognition.
Broad Applicability: While focused on NLP, the methodological advances in DBCA have implications for other domains requiring systematic generalization, such as visual reasoning or robotic control tasks.

By dissecting the interaction between dataset structure and ML capabilities, this paper lays the groundwork for developing models that can achieve truly compositional understanding, a cornerstone for advancing toward general AI. Future research could delve into the architectural enhancements necessary to meet these benchmarks or explore DBCA in multimodal datasets.

PDF Markdown