Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks (2210.14712v1)

Published 26 Oct 2022 in cs.CL and cs.AI

Abstract: We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for LLMing, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Colin Leong (10 papers)
  2. Joshua Nemecek (2 papers)
  3. Jacob Mansdorfer (1 paper)
  4. Anna Filighera (4 papers)
  5. Abraham Owodunni (5 papers)
  6. Daniel Whitenack (5 papers)
Citations (20)

Summary

Overview of "Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks"

The paper, "Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks," presents a comprehensive suite of datasets developed from the Bloom Library. These datasets aim to address the scarcity of multimodal and multilingual resources available for various NLP tasks, specifically focusing on underrepresented languages. The initial release covers 363 languages across 32 language families and introduces datasets facilitating tasks such as LLMing, image captioning, visual storytelling, and speech synthesis/recognition.

Dataset Composition and Release

The authors have meticulously constructed four different datasets:

  1. bloom-lm: This dataset includes text data from 351 languages, laying the groundwork for LLMing.
  2. bloom-captioning: Comprising image-caption pairs, this dataset assists in image-to-text and text-to-image tasks for 351 languages.
  3. bloom-vist: A dataset designed for visual storytelling, aligning sequential images with text in 351 languages.
  4. bloom-speech: Featuring speech-to-text and text-to-speech data in 56 languages, this dataset represents a significant expansion in language families included in aligned speech datasets.

The datasets are publicly available under Creative Commons licenses on the Hugging Face datasets hub, facilitating broader access and collaboration in low-resource, multimodal NLP research.

Baseline Experiments and Evaluation

To evaluate the datasets' utility, the authors conducted baseline experiments on LLMing, image captioning, and automatic speech recognition. Using pre-trained models and fine-tuning strategies, impressive results were obtained for several languages, including low-resource languages like Bisu. Notably, Bisu achieved a WER of 0.11, a significant result considering its limited population base.

  • For LLMing, DistilBERT was fine-tuned on the bloom-lm dataset, achieving a range of perplexity scores, with Akha reaching a perplexity as low as 3.06.
  • In image captioning, adapted models from established methodologies were employed, with Thai achieving the highest BLEU score of 31.2.
  • For speech recognition, fine-tuning XLS-R models with bloom-speech data yielded notable performance improvements, underlined by the Bisu language's low WER.

Implications and Future Directions

The Bloom Library datasets present far-reaching implications for NLP research and application. By significantly enhancing language and modality coverage, these datasets enable researchers to train and fine-tune models for a wide array of underrepresented languages. This is a critical step towards reducing systematic inequalities in AI research and application. The datasets can catalyze further linguistic diversity in NLP research, opening opportunities for cross-lingual and multilingual tasks, particularly where parallel corpora are required.

The authors acknowledge certain limitations concerning data quality and bias, particularly in community-submitted content, which introduces variability in dataset consistency. Future work should focus on expanding dataset size and quality, further exploring baseline performance characteristics, and developing aligned multilingual versions for other potential applications.

In summary, the Bloom Library datasets mark substantial progress in building multimodal and multilingual resources for NLP tasks. By paving the way for more inclusive AI research that encompasses diverse linguistic ecologies, this work sets a precedent for future endeavors in low-resource language technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com