Learning to Summarize and Answer Questions about a Virtual Robot's Past Actions (2306.09922v1)
Abstract: When robots perform long action sequences, users will want to easily and reliably find out what they have done. We therefore demonstrate the task of learning to summarize and answer questions about a robot agent's past actions using natural language alone. A single system with a LLM at its core is trained to both summarize and answer questions about action sequences given ego-centric video frames of a virtual robot and a question prompt. To enable training of question answering, we develop a method to automatically generate English-language questions and answers about objects, actions, and the temporal order in which actions occurred during episodes of robot action in the virtual environment. Training one model to both summarize and answer questions enables zero-shot transfer of representations of objects learned through question answering to improved action summarization. % involving objects not seen in training to summarize.
- Apostolidis E, Adamantidou E, Metsai AI, et al (2021) Video summarization using deep neural networks: A survey. arXiv preprint arXiv:210106072 Bärmann and Waibel [2022] Bärmann L, Waibel A (2022) Where did i leave my keys? - episodic-memory-based question answering on egocentric videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 1560–1568 Barrett et al [2015] Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Bärmann L, Waibel A (2022) Where did i leave my keys? - episodic-memory-based question answering on egocentric videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 1560–1568 Barrett et al [2015] Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Bärmann L, Waibel A (2022) Where did i leave my keys? - episodic-memory-based question answering on egocentric videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp 1560–1568 Barrett et al [2015] Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Barrett DP, Bronikowski SA, Yu H, et al (2015) Robot language learning, generation, and comprehension. arXiv preprint arXiv:150806161 Barrett et al [2017] Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Barrett DP, Bronikowski SA, Yu H, et al (2017) Driving under the influence (of language). IEEE transactions on neural networks and learning systems 29(7):2668–2683 Bisk et al [2020] Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Bisk Y, Holtzman A, Thomason J, et al (2020) Experience grounds language. arXiv preprint arXiv:200410151 Carta et al [2022] Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Carta T, Lamprier S, Oudeyer PY, et al (2022) Eager: Asking and answering questions for automatic reward shaping in language-guided rl. arXiv preprint arXiv:220609674 Chandu et al [2021] Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Chandu KR, Bisk Y, Black AW (2021) Grounding’grounding’in nlp. arXiv preprint arXiv:210602192 Datta et al [2022] Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Datta S, Dharur S, Cartillier V, et al (2022) Episodic memory question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19119–19128 DeChant and Bauer [2021] DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- DeChant C, Bauer D (2021) Toward robots that learn to summarize their actions in natural language: a set of tasks. In: 5th Annual Conference on Robot Learning, Blue Sky Submission Track Dzifcak et al [2009] Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Dzifcak J, Scheutz M, Baral C, et al (2009) What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: 2009 IEEE International Conference on Robotics and Automation, IEEE, pp 4163–4168 Fried et al [2018] Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Fried D, Hu R, Cirik V, et al (2018) Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 3318–3329 Gambhir and Gupta [2017] Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47(1):1–66 Gao et al [2021] Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Gao D, Wang R, Bai Z, et al (2021) Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1675–1685 Gordon et al [2018] Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Gordon D, Kembhavi A, Rastegari M, et al (2018) Iqa: Visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098 Harnad [1990] Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1-3):335–346 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 Kolve et al [2017] Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Kolve E, Mottaghi R, Han W, et al (2017) Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:171205474 Lin [2004] Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81 Lu et al [2022] Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Lu K, Grover A, Abbeel P, et al (2022) Frozen pretrained transformers as universal computation engines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7628–7636 McClelland et al [2020] McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- McClelland JL, Hill F, Rudolph M, et al (2020) Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences 117(42):25966–25974 McDermott et al [1998] McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- McDermott D, Ghallab M, Howe A, et al (1998) Pddl-the planning domain definition language. Technical Report, Tech Rep Mees et al [2021] Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Mees O, Hermann L, Rosete-Beas E, et al (2021) Calvin - a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. arXiv preprint arXiv:211203227 Mooney [2008] Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Mooney RJ (2008) Learning to connect language and perception. In: AAAI, pp 1598–1601 Nenkova and McKeown [2012] Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Nenkova A, McKeown K (2012) A survey of text summarization techniques. In: Mining text data. Springer, p 43–76 Nguyen et al [2021] Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Nguyen KX, Misra D, Schapire R, et al (2021) Interactive learning from activity description. In: International Conference on Machine Learning, PMLR, pp 8096–8108 Palaskar et al [2019] Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Palaskar S, Libovický J, Gella S, et al (2019) Multimodal abstractive summarization for how2 videos. In: ACL Papineni et al [2002] Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318 Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:210300020 Raffel et al [2020] Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21:1–67 Sanabria et al [2018] Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Sanabria R, Caglayan O, Palaskar S, et al (2018) How2: A large-scale dataset for multimodal language understanding. ArXiv abs/1811.00347 Shridhar et al [2020] Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Shridhar M, Thomason J, Gordon D, et al (2020) Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749 Shridhar et al [2021] Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Shridhar M, Manuelli L, Fox D (2021) Cliport: What and where pathways for robotic manipulation. In: 5th Annual Conference on Robot Learning Tangiuchi et al [2019] Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Tangiuchi T, Mochihashi D, Nagai T, et al (2019) Survey on frontiers of language and robotics. Advanced Robotics 33(15-16):700–730 Tellex et al [2014] Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Tellex S, Knepper R, Li A, et al (2014) Asking for help using inverse semantics. Robotics: Science and Systems X Tellex et al [2020] Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Tellex S, Gopalan N, Kress-Gazit H, et al (2020) Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems 3:25–55 Thomason et al [2019] Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Thomason J, Padmakumar A, Sinapov J, et al (2019) Improving grounded natural language understanding through human-robot dialog. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 6934–6941 Tsimpoukelli et al [2021] Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Tsimpoukelli M, Menick JL, Cabi S, et al (2021) Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34:200–212 Winograd [1972] Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Winograd T (1972) Understanding natural language. Cognitive psychology 3(1):1–191 Wolf et al [2020] Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Wolf T, Chaumond J, Debut L, et al (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 38–45 Yoshino et al [2021] Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241 Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241
- Yoshino K, Wakimoto K, Nishimura Y, et al (2021) Caption generation of robot behaviors based on unsupervised learning of action segments. In: Conversational Dialogue Systems for the Next Decade. Springer, p 227–241