A Benchmark for Learning to Translate a New Language from One Grammar Book (2309.16575v2)

Published 28 Sep 2023 in cs.CL

Abstract: LLMs can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces a novel benchmark, MTOB, that mimics L2 learning by using a single grammar book to translate a low-resource language.
The methodology leverages structured linguistic references and evaluates models using metrics like chrF scores against a human baseline.
The findings underscore the value of comprehensive context retrieval while highlighting current LLMs’ limitations in matching human translation accuracy.

Benchmarking Learning-to-Translate Using a Single Grammar Book

The paper "A Benchmark for Learning to Translate a New Language from One Grammar Book" by Garrett Tanzer et al. introduces MTOB (Machine Translation from One Book), a novel benchmark aimed at evaluating LLMs' capabilities in translating between English and Kalamang—a low-resource language virtually absent from internet datasets. This approach leverages structured linguistics reference materials, rather than large mined corpora, highlighting a unique paradigm more akin to second-language (L2) learning.

Key Contributions

This work provides several unique contributions:

Novel Task Framing: The benchmark challenges models to translate using a single book of grammatical explanations and bilingual resources, rather than extensive in-domain parallel data. This setup mimics an L2 classroom learning environment, verifying if LLMs can adapt to new, genuinely unseen tasks.
Evaluation of LLM Capabilities: The paper positions MTOB as an insightful measure of current LLMs' abilities to adapt using structured but limited data sources. Baseline models like text-davinci-003, gpt-3.5-turbo, gpt-4, and Claude 2 exhibit promising results, though they fall short of human benchmarks. Notably, Claude 2 achieves 44.7 chrF for Kalamang to English and 45.8 chrF for English to Kalamang, against 51.6 and 57.0 chrF respectively by a human.
Human Baseline: The paper incorporates a human baseline, achieved by the first author who methodically studied the grammatical material and performed translations, providing a crucial lower bound for what is potentially achievable.
Implications for Low-resource Languages: By focusing on Kalamang, an underrepresented language with under 200 speakers, the initiative addresses scalability limits and promotes data-efficient model training methods that may benefit communities with limited digital presence.

Dataset and Methodology

The MTOB dataset comprises three main components:

Grammar Book: The text from A Grammar of Kalamang, providing comprehensive linguistic details.
Bilingual Word List: A dictionary offering word translations and part-of-speech tags.
Parallel Corpus: Approximately 500 sentence pairs, curated and split into training and test sets.

Experimental Setup

The research evaluates 12 models, including pretrained public models (e.g., LLaMA, Llama 2), finetuned variants using the grammar book, and several API-based models (e.g., gpt-4, Claude 2). For each model, different contextual retrieval schemes are applied:

No Context (-): Baseline without additional context.
Wordlist Context (W): Retrieval of relevant dictionary entries.
Sentence Context (S): Similar sentences from the parallel corpus.
Grammar Book Context (G): Chunks from the grammar book retrieved by either longest-common substring or embedding similarity.

Longer context windows, such as the entire grammar book, were tested primarily with Claude 2 due to its extended token handling capabilities.

Results and Analysis

Performance Trends: Larger, more advanced models naturally performed better, with contextual information improving results incrementally. The context from word lists and sentences proved more beneficial than grammatical excerpts, underperforming due to sub-optimal retrieval of contextual grammar passages.

Claude 2 Performance: The top performance from Claude 2 involved leveraging the entire length of the grammar book (G^l), underlining the importance of comprehensive context.

Human Baseline: The human professional outperformed all machine baselines, illuminating the gap between current machine translation capabilities and the nuanced understanding and adaptation achievable by humans.

Implications and Future Work

Practical and Theoretical Impact: MTOB sets a relevant precedent in translation tasks, especially highlighting the versatility needed in low-resource contexts where enormous, diverse corpora are unavailable. The benchmark illustrates a pathway where structured human-like linguistic guidance can significantly inform machine learning approaches.

Potential Developments: Optimizing context retrieval methods, expanding the benchmark to other low-resource languages, and developing multimodal models that can handle speech and text integrated learning scenarios, emulate real-world L2 language acquisition, and mitigate the discrepancies revealed. Moreover, the work encourages a participatory dialogue with linguistic communities, ensuring culturally sensitive application developments.

Conclusion

"A Benchmark for Learning to Translate a New Language from One Grammar Book" presents an innovative approach to evaluating LLMs' adaptability, employing a linguistically informed task to drive genuine advancements in low-resource machine translation. The benchmark and findings propose a future where smaller, well-documented datasets and comprehensive context windows can enable more equitable access to language technology for underrepresented languages.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jxmnop/status/1805756434824806499

https://twitter.com/tallinzen/status/1757202478683013544

https://twitter.com/hingeloss/status/1795965846830879210

https://twitter.com/fly51fly/status/1757536169519464573

https://twitter.com/cashkitten420/status/1806009044538368418

https://twitter.com/st01014/status/1762870124699332836