J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution (2403.19259v1)
Abstract: Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.
- Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.
- Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130:33–55.
- Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 25:8725–8737.
- Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), pages 18995–19012.
- LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Contrastive learning for weakly supervised phrase grounding. In Computer Vision – ECCV 2020, pages 752–768, Cham. Springer International Publishing.
- Building a diverse document leads corpus annotated with semantic relations. In Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation (PACLIC), pages 535–544.
- Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Mdetr–modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790.
- Construction of a Japanese Relevance-tagged Corpus. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA).
- Satwik Kottur and Seungwhan Moon. 2023. Overview of situated and interactive multimodal conversations (SIMMC) 2.1 track at DSTC 11. In Proceedings of The Eleventh Dialog System Technology Challenge, pages 235–241, Prague, Czech Republic. Association for Computational Linguistics.
- SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4903–4912, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (IJCV), 123(1):32–73.
- Refego: Referring expression comprehension dataset from first-person perception of ego4d. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15214–15224.
- Sadao Kurohashi and Makoto Nagao. 1998. Building a Japanese Parsed Corpus while Improving the Parsing System. In International Conference on Language Resources and Evaluation (LREC’98), pages 719–724.
- A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Generation and Comprehension of Unambiguous Object Descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20.
- A visually-grounded parallel corpus with phrase-to-region linking. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 4204–4210, Marseille, France. European Language Resources Association.
- Egocentric Biochemical Video-and-Language Dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3129–3133.
- BioVL2: An Egocentric Biochemical Video-and-Language Dataset. Journal of Natural Language Processing, 29(4):1106–1137. In Japanese.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision (IJCV), 123(1):74–93.
- Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, 23:4426–4440.
- Home Action Genome: Cooperative Compositional Action Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11184–11193.
- Transformers in small object detection: A benchmark and survey of state-of-the-art.
- A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 769–776, Manchester, UK. Coling 2008 Organizing Committee.
- Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3570–3577, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Juman++: A Morphological Analysis Toolkit for Scriptio Continua. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 54–59, Brussels, Belgium. Association for Computational Linguistics.
- BERT-based cohesion analysis of Japanese texts. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1323–1333, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- KWJA: A Unified Japanese Analyzer Based on Foundation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 538–548, Toronto, Canada. Association for Computational Linguistics.
- Japanese zero anaphora resolution can benefit from parallel texts through neural transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1920–1934, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Japanese Dialogue Corpus of Information Navigation and Attentive Listening Annotated with Extended ISO-24617-2 Dialogue Act Tags. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 2922–2927, Miyazaki, Japan. European Language Resources Association (ELRA).
- Modeling Context in Referring Expressions. In Computer Vision – ECCV 2016, pages 69–85, Cham. Springer International Publishing.
- What you see is what you get: Visual pronoun coreference resolution in dialogues. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5123–5132, Hong Kong, China. Association for Computational Linguistics.
- Detecting twenty-thousand classes using image-level supervision. In Computer Vision – ECCV 2022, pages 350–368, Cham. Springer Nature Switzerland.