ReALM: Reference Resolution As Language Modeling (2403.20329v2)
Abstract: Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user's screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a LLMing problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Marrs: Multimodal reference resolution system. In Proceedings of The Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023), pages 51–58.
- Metro: Efficient denoising pretraining of large scale autoencoding language models with model generated signals. arXiv preprint arXiv:2204.06644.
- Referring to screen texts with voice assistants. In ACL.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10767–10775.
- Cost-effective end-to-end information extraction for semi-structured document images. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3375–3383, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Spatial dependency parsing for semi-structured document information extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 330–343, Online. Association for Computational Linguistics.
- Dual attention networks for visual reference resolution in visual dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2024–2033.
- Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 153–169.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
- Alice Ljungholm. 2021. Voice interaction vs screen interaction when controlling your music-system. In Proceedings of the 21st Student Conference in Interaction Technology and Design, pages 103–108.
- Ewa Luger and Abigail Sellen. 2016. " like having a really bad pa" the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 5286–5297.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Beyond english-centric bitexts for better multilingual language representation learning. arXiv preprint arXiv:2210.14867.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
- Learning transferable visual models from natural language supervision.
- ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
- Factor graph attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2039–2048.
- Heroes, villains, and victims, and GPT-3: Automated extraction of character roles without training data. In Proceedings of the 4th Workshop of Narrative Understanding (WNU2022), pages 47–56, Seattle, United States. Association for Computational Linguistics.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Emergent abilities of large language models. Transactions on Machine Learning Research.
- LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2579–2591, Online. Association for Computational Linguistics.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.