Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems (2212.08192v2)

Published 15 Dec 2022 in cs.CL and cs.LG

Abstract: Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural LLMs. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts. These subtasks differ in terms of which knowledge sources contain the relevant facts. We also introduce subtasks where knowledge is present only at inference time using fictional knowledge. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Akshatha Arodi (3 papers)
  2. Martin Pömsl (3 papers)
  3. Kaheer Suleman (19 papers)
  4. Adam Trischler (50 papers)
  5. Alexandra Olteanu (28 papers)
  6. Jackie Chi Kit Cheung (57 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.