Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Speech Recognition for Language-Guided Embodied Agents (2302.14030v3)

Published 27 Feb 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models. github.com/Cylumn/embodied-multimodal-asr

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Allen Chang (6 papers)
  2. Xiaoyuan Zhu (5 papers)
  3. Aarav Monga (1 paper)
  4. Seoho Ahn (1 paper)
  5. Tejas Srinivasan (20 papers)
  6. Jesse Thomason (65 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.