Emergent Mind

ReALM: Reference Resolution As Language Modeling

Published Mar 29, 2024 in cs.CL , cs.AI , and cs.LG


Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.
Performance improvements across experiments, from baseline finetuning to advanced onscreen grabs and element separation techniques.


  • The paper introduces ReALM, a new approach leveraging LLMs (LLM) to improve reference resolution across conversational, on-screen, and background entities, demonstrating significant advancements over existing systems.

  • ReALM transforms the challenge of reference resolution into a language modeling problem, enabling effective handling of various entity types by converting them into a text-based format for LLM processing.

  • The methodology involves encoding entities into a textual format and formulating the resolution task as a multiple-choice problem, allowing ReALM to outperform traditional NLP systems and achieve comparable results with GPT versions 3.5 and 4.

  • Future directions include refining the encoding techniques to better capture spatial relationships and contextual details, aiming for more intuitive conversational agents.


Reference resolution stands as a pivotal component in enhancing the capability of conversational agents, allowing systems to grasp the context that spans beyond immediate dialogue and encompasses various forms of non-conversational entities. The paper presents a novel approach, termed ReALM (Reference Resolution as Language Modeling), which utilizes LLMs (LLM) to address the challenge of reference resolution across conversational, on-screen, and background entities. This method notably achieves significant performance improvements over existing systems and demonstrates compatibility and comparable results against benchmarks like GPT-3.5 and GPT-4.

Problem Statement and Motivation

Conversational agents are expected to interpret ambiguous references seamlessly, similar to human conversational understanding. This includes the deciphering of context from prior dialogues and the dynamic content displayed on a user's screen or entities operating in the background. While LLMs have shown exceptional proficiency across various tasks, their application in reference resolution, particularly for entities not inherently textual, remains scarcely explored. ReALM aims to bridge this gap by redefining reference resolution into a language modeling problem. This shift allows the system to effectively handle references to on-screen content by transforming screen elements into a text-based format conducive for LLM processing.

Related Work

ReALM is distinct in its endeavor to fuse conversational context with on-screen entity resolution using LLMs. Previous systems either specialized in one domain without overarching coverage or employed non-LLM methodologies that lacked the flexibility and scalability inherent to LLMs. Notably, ReALM advances beyond the conventional pipeline approaches, integrating the responsiveness and adaptability of LLMs to engage with a wider array of reference types without extensive manual intervention.


At the core of ReALM is the transformation of the reference resolution task into a multiple-choice problem solvable by LLMs. The process encodes different types of entities (conversational, on-screen, and background) into a textual format that an LLM can interpret. This encoding includes the conversion of on-screen entities, previously a challenging domain for text-based models, into a linear, textually represented format that maintains spatial awareness. The approach leverages datasets synthesized and annotated to reflect the variety of reference types, enabling comprehensive model training and evaluation.

Experiments and Results

ReALM was evaluated against traditional NLP systems and the latest versions of GPT (3.5 and 4). The results displayed superior performance of ReALM in resolving references, especially noting significant improvements in handling on-screen references. Particularly, ReALM exhibited a notable capability to perform comparably to GPT-4 with a fraction of the computational resources, highlighting its efficiency and effectiveness.


The analysis reveals several insights:

  • ReALM demonstrates enhanced performance in domain-specific queries over GPT-4, attributed to fine-tuning on targeted datasets.

  • The model showcases robustness and versatility in zero-shot settings, outperforming traditional fixed-task models across unseen domains.

  • The encoding scheme for on-screen entities in text form, while effective, suggests potential for further refinement to capture more nuanced spatial relationships and contextual details.

Conclusion and Future Directions

ReALM represents a significant stride towards integrating LLMs within the realm of reference resolution. By converting diverse entity types into a unified text-based format, the system can leverage the vast knowledge and flexibility of LLMs to interpret and act upon user references accurately. Future explorations may delve into more sophisticated encoding techniques to enhance model understanding of complex spatial and contextual nuances, paving the way for even more intuitive and responsive conversational agents.


This research advances the capabilities of conversational agents, enabling more natural and efficient user interactions. By harnessing the power of LLMs for reference resolution, systems can achieve a deeper understanding of context, significantly enhancing user experience across a variety of applications.

Speculations on Future AI Developments

The findings from ReALM may stimulate further investigations into the use of LLMs for additional NLP tasks, particularly where traditional models struggle with the integration of varied data types or require significant manual tuning. As LLMs continue to evolve, their potential to revolutionize conversational AI and beyond becomes increasingly evident, promising a future where machines can understand and respond to human language with unprecedented accuracy and nuance.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

ReALM: Reference Resolution as Language Modeling (1 point, 0 comments) in /r/hypeurls
  1. GPT-4 Technical Report
  2. Marrs: Multimodal reference resolution system. In Proceedings of The Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023), pages 51–58.
  3. METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
  4. Referring to screen texts with voice assistants. In ACL.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Scaling Instruction-Finetuned Language Models
  7. QLoRA: Efficient Finetuning of Quantized LLMs
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  9. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  10. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775.
  11. Cost-effective end-to-end information extraction for semi-structured document images. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3375–3383, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, Online. Association for Computational Linguistics.
  13. Dual attention networks for visual reference resolution in visual dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2024–2033.
  14. Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 153–169.
  15. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
  16. Alice Ljungholm. 2021. Voice interaction vs screen interaction when controlling your music-system. In Proceedings of the 21st Student Conference in Interaction Technology and Design, pages 103–108.
  17. Ewa Luger and Abigail Sellen. 2016. " like having a really bad pa" the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 5286–5297.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  19. Gorilla: Large Language Model Connected with Massive APIs
  20. Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning
  21. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
  22. Learning transferable visual models from natural language supervision
  23. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  24. Factor graph attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2039–2048.
  25. Heroes, villains, and victims, and GPT-3: Automated extraction of character roles without training data. In Proceedings of the 4th Workshop of Narrative Understanding (WNU2022), pages 47–56, Seattle, United States. Association for Computational Linguistics.
  26. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
  27. Llama 2: Open Foundation and Fine-Tuned Chat Models
  28. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  29. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  30. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  31. Emergent abilities of large language models. Transactions on Machine Learning Research.
  32. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
  33. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
  34. Vector-quantized Image Modeling with Improved VQGAN
  35. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Show All 35

Test Your Knowledge

You answered out of questions correctly.

Well done!