Ascertain training-data contamination of large language models with The Outer Worlds content

Ascertain whether The Outer Worlds game data underlying the Knudge dialogues are included in the pretraining corpora of the large language models (e.g., T5 and GPT-3) used in the experiments, in order to assess potential training-data contamination and interpret results appropriately.

Background

The study employs large pretrained LLMs such as T5 and GPT-3 to generate dialogue, but the authors note uncertainty about whether these models' training data contain content from The Outer Worlds, which was released in 2019.

This uncertainty affects the interpretation of experimental results due to possible data contamination or memorization. The authors partially mitigate this by evaluating on designer-written, previously unseen quest specifications, yet explicitly acknowledge the unresolved question of whether the game data are in the models' pretraining corpora.

References

It is difficult to know whether the game data used for experimentation is part of the training data for such models, as The Outer Worlds came out in 2019.

Ontologically Faithful Generation of Non-Player Character Dialogues (2212.10618 - Weir et al., 2022) in Limitations