Language model acceptability judgements are not always robust to context (2212.08979v1)

Published 18 Dec 2022 in cs.CL and cs.LG

Abstract: Targeted syntactic evaluations of LLMs ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match LLMs' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of LLMs' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.

Authors (7)

Koustuv Sinha (31 papers)
Jon Gauthier (11 papers)
Aaron Mueller (35 papers)
Kanishka Misra (20 papers)
Keren Fuentes (2 papers)
Roger Levy (43 papers)
Adina Williams (72 papers)

Citations (15)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Language model acceptability judgements are not always robust to context (2212.08979v1)

Summary

Related Papers