Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Infinity-Chat: Bilingual Dialogue Dataset

Updated 2 November 2025
  • Infinity-Chat Dataset is a massive paired bilingual dialogue corpus featuring 28M English-Chinese and 18M English-German dialogues that enable robust neural chat translation.
  • It employs automated data acquisition, advanced sentence alignment with Vecalign and LASER, and four-utterance grouping to ensure authentic conversational context and high-quality training data.
  • Its integration within a staged multi-task learning framework has yielded BLEU improvements of up to 5 points, highlighting its significant impact on neural chat translation research.

Large-scale in-domain paired bilingual dialogue datasets for neural chat translation constitute the most extensive publicly-available parallel chat corpora to date for English-Chinese (En↔Zh) and English-German (En↔De) (Liang et al., 2022). Designed specifically to address the limitations of prior small-scale, out-of-domain, or noisy resources, these corpora enable robust modeling and training of context- and coherence-aware Neural Chat Translation (NCT) systems. They support scalable, data-driven advances in conversational machine translation by providing orders of magnitude more in-domain bilingual dialogue data than any previously available open source alternative.

1. Dataset Composition and Statistics

These datasets are released as two separate corpora:

Dataset Dialogues Utterances Sentences Languages
En↔Zh 28,214,769 28,238,877 22,244,006 English-Chinese
En↔De 18,041,125 18,048,573 45,541,367 English-German

Each dialogue comprises four consecutive utterances, reflecting the preferred contextual window for chat translation models as articulated in the literature. The overall scale—approximately 28 million dialogues for English-Chinese and 18 million for English-German—places these resources several orders of magnitude above antecedent datasets in both total token count and coverage of conversational phenomena.

2. Data Acquisition and Processing Methodology

The construction pipeline is fully automated and leverages high-quality open data and robust alignment algorithms, with no reported manual post-editing. The essential stages are:

  1. Raw Data Collection:
  2. Sentence Alignment:
    • Employ Vecalign [Thompson & Koehn, 2019], a modern sentence alignment tool, together with LASER multilingual embeddings for enhanced alignment robustness across both language pairs.
  3. Dialogue Extraction:
    • Aggregate every four consecutive utterances into one dialogue, mirroring typical chat context window settings.
  4. Filtering and Deduplication:
    • Remove duplicate dialogues to improve data quality and minimize noise.

This results in two highly cleaned, paired bilingual corpora directly reflecting in-domain (film/TV) conversational style, closely matching chat interaction patterns observed in real-world applications.

3. Unique Features and Domain Coverage

Several attributes distinguish these datasets within the machine translation and dialogue resource landscape:

  • Scale:

The largest public in-domain bilingual dialogue corpora (28M En–Zh, 18M En–De), enabling unprecedented statistical modeling for chat translation.

  • Domain Fidelity:

Sourced from natural movie subtitles, the data exhibits authentic conversational structure and turn-taking.

  • Utterance Grouping:

Consistent use of four-utterance sequences as dialogue contexts ensures context-preserving parallelism.

  • High-Quality Automated Pairing:

Use of LASER and Vecalign ensures alignment precision crucial for minimizing contextual drift across bilingual turns.

  • Extensive Deduplication and Filtering:

Automated cleaning processes result in corpora suitable for large-scale neural training without substantial manual correction.

4. Licensing, Accessibility, and Intended Use

The datasets—and associated code—are made publicly available at https://github.com/XL2248/SML. While the main paper does not specify a precise license, researchers are advised to consult the linked repository for up-to-date terms. The resource is primarily intended for research in neural chat translation and related areas such as dialogue modeling, bilingual context learning, and conversational AI.

Given the source material, prospective users should ensure compliance with any upstream licensing of subtitle data and restrict usage to research and non-commercial investigations unless more permissive terms are clarified in the repository documentation.

5. Integration within Neural Chat Translation Frameworks

The corpora are pivotal within the scheduled multi-task learning (SML) paradigm introduced in the paper (Liang et al., 2022). The SML strategy utilizes a three-stage training workflow:

  1. General Pre-training: On general parallel sentence data (e.g., WMT20 corpus), to initialize base translation capability.
  2. In-domain Pre-training: Utilizing the large paired bilingual dialogue datasets to allow models to absorb conversation-specific lexical, syntactic, and pragmatic features.
  3. In-domain Fine-tuning: Targeted adaptation using domain-specific benchmarks (BMELD, BConTrasT) combined with scheduled activation of auxiliary dialogue-coherence tasks.

The scheduled multi-task objective for in-domain fine-tuning is formalized as:

J=LNCT+αkTLk\mathcal{J} = \mathcal{L}_{\mathrm{NCT}} + \alpha \sum_{k \in \mathcal{T}} \mathcal{L}_k

where LNCT\mathcal{L}_{\mathrm{NCT}} denotes the main chat translation loss, T\mathcal{T} is the set of auxiliary tasks (e.g., coherence modeling), and α\alpha is a balancing factor.

A staged approach is consistently shown, in empirical studies and ablations, to yield marked improvements over single-phase or naïve multitask baselines.

6. Empirical Impact and Research Significance

Extensive experiments reported in (Liang et al., 2022) demonstrate that models trained with these corpora and the three-stage SML regime achieve substantial improvements. Notable performance gains are observed:

  • English–Chinese direction: +4–5 BLEU improvement over competitive methods.
  • English–German direction: +2–3 BLEU improvement.

Ablation studies attribute these improvements directly to the availability and integration of the large-scale in-domain paired datasets. Their scale and quality also facilitate advanced multi-task and schedule-based learning algorithms, including switching auxiliary objectives at appropriate stages, without catastrophic forgetting or negative transfer effects.

The release of these in-domain resources fills a longstanding gap for context-rich, parallel bilingual chat data, particularly empowering new research in dialog-aware neural machine translation and sophisticated, conversational context modeling. By supporting compositional, multi-stage, and multitask frameworks, these datasets are expected to underpin further research advances in high-fidelity, context-coherent dialogue translation and the broader field of conversational AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Infinity-Chat Dataset.