Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding (2210.04600v2)

Published 10 Oct 2022 in cs.CL and eess.AS

Abstract: Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor`ub\'a -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yor`ub\'a utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yor`ub\'a speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more data. We hope that this new dataset will stimulate research in the use of VGS models for real low-resource languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kayode Olaleye (7 papers)
  2. Dan Oneata (24 papers)
  3. Herman Kamper (80 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.