ReALM: Reference Resolution As Language Modeling

Published Mar 29, 2024 in cs.CL , cs.AI , and cs.LG


Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

Performance improvements across experiments, from baseline finetuning to advanced onscreen grabs and element separation techniques.


  • The paper introduces ReALM, a new approach leveraging LLMs (LLM) to improve reference resolution across conversational, on-screen, and background entities, demonstrating significant advancements over existing systems.

  • ReALM transforms the challenge of reference resolution into a language modeling problem, enabling effective handling of various entity types by converting them into a text-based format for LLM processing.

  • The methodology involves encoding entities into a textual format and formulating the resolution task as a multiple-choice problem, allowing ReALM to outperform traditional NLP systems and achieve comparable results with GPT versions 3.5 and 4.

  • Future directions include refining the encoding techniques to better capture spatial relationships and contextual details, aiming for more intuitive conversational agents.

ReALM: Elevating Reference Resolution with Language Modeling


Reference resolution stands as a pivotal component in enhancing the capability of conversational agents, allowing systems to grasp the context that spans beyond immediate dialogue and encompasses various forms of non-conversational entities. The paper presents a novel approach, termed ReALM (Reference Resolution as Language Modeling), which utilizes LLMs (LLM) to address the challenge of reference resolution across conversational, on-screen, and background entities. This method notably achieves significant performance improvements over existing systems and demonstrates compatibility and comparable results against benchmarks like GPT-3.5 and GPT-4.

Problem Statement and Motivation

Conversational agents are expected to interpret ambiguous references seamlessly, similar to human conversational understanding. This includes the deciphering of context from prior dialogues and the dynamic content displayed on a user's screen or entities operating in the background. While LLMs have shown exceptional proficiency across various tasks, their application in reference resolution, particularly for entities not inherently textual, remains scarcely explored. ReALM aims to bridge this gap by redefining reference resolution into a language modeling problem. This shift allows the system to effectively handle references to on-screen content by transforming screen elements into a text-based format conducive for LLM processing.

Related Work

ReALM is distinct in its endeavor to fuse conversational context with on-screen entity resolution using LLMs. Previous systems either specialized in one domain without overarching coverage or employed non-LLM methodologies that lacked the flexibility and scalability inherent to LLMs. Notably, ReALM advances beyond the conventional pipeline approaches, integrating the responsiveness and adaptability of LLMs to engage with a wider array of reference types without extensive manual intervention.


At the core of ReALM is the transformation of the reference resolution task into a multiple-choice problem solvable by LLMs. The process encodes different types of entities (conversational, on-screen, and background) into a textual format that an LLM can interpret. This encoding includes the conversion of on-screen entities, previously a challenging domain for text-based models, into a linear, textually represented format that maintains spatial awareness. The approach leverages datasets synthesized and annotated to reflect the variety of reference types, enabling comprehensive model training and evaluation.

Experiments and Results

ReALM was evaluated against traditional NLP systems and the latest versions of GPT (3.5 and 4). The results displayed superior performance of ReALM in resolving references, especially noting significant improvements in handling on-screen references. Particularly, ReALM exhibited a notable capability to perform comparably to GPT-4 with a fraction of the computational resources, highlighting its efficiency and effectiveness.


The analysis reveals several insights:

  • ReALM demonstrates enhanced performance in domain-specific queries over GPT-4, attributed to fine-tuning on targeted datasets.
  • The model showcases robustness and versatility in zero-shot settings, outperforming traditional fixed-task models across unseen domains.
  • The encoding scheme for on-screen entities in text form, while effective, suggests potential for further refinement to capture more nuanced spatial relationships and contextual details.

Conclusion and Future Directions

ReALM represents a significant stride towards integrating LLMs within the realm of reference resolution. By converting diverse entity types into a unified text-based format, the system can leverage the vast knowledge and flexibility of LLMs to interpret and act upon user references accurately. Future explorations may delve into more sophisticated encoding techniques to enhance model understanding of complex spatial and contextual nuances, paving the way for even more intuitive and responsive conversational agents.


This research advances the capabilities of conversational agents, enabling more natural and efficient user interactions. By harnessing the power of LLMs for reference resolution, systems can achieve a deeper understanding of context, significantly enhancing user experience across a variety of applications.

Speculations on Future AI Developments

The findings from ReALM may stimulate further investigations into the use of LLMs for additional NLP tasks, particularly where traditional models struggle with the integration of varied data types or require significant manual tuning. As LLMs continue to evolve, their potential to revolutionize conversational AI and beyond becomes increasingly evident, promising a future where machines can understand and respond to human language with unprecedented accuracy and nuance.


