Bilingual English-Chinese Chapter Dataset
- The dataset is built using multi-stage annotation pipelines that combine heuristics, crowd-sourced reviews, and large language model prompts to ensure high precision.
- It encompasses both literary and multimodal sources with varied alignment granularities, including chapter and paragraph-level units for diverse research applications.
- The resource drives progress in cross-lingual NLP by supporting tasks such as literary translation, machine reading comprehension, and multimodal chapter segmentation.
A bilingual English–Chinese chapter dataset refers to any systematically curated resource in which chapter-level (or subchapter-level, e.g., paragraph) units from works in English and Chinese are aligned and annotated, supporting a range of cross-lingual and multilingual NLP and multimodal tasks. Such datasets serve as foundational benchmarks for research into literary translation, machine reading comprehension, video chaptering, and discourse-aware modeling across languages. Recent public datasets span text-only literary corpora and multimodal video datasets, with a diversity of alignment granularities and annotation structures.
1. Dataset Types and Alignment Granularities
Bilingual English–Chinese chapter datasets can be broadly grouped by domain and alignment unit:
- Textual literary datasets: Collect chapters or paragraphs from novels with professional or crowdsourced human alignment. Example: JAM corpus, aligned at the chapter level, and BiPaR, aligned at the paragraph level (Jing et al., 2019, Jin et al., 12 Jul 2024).
- Multimodal video datasets: Annotate chapter boundaries and summaries in both English and Chinese for long-form videos, synchronizing textual, audio, and visual information. Example: VidAtlas dataset in ARC-Chapter (Pu et al., 18 Nov 2025).
Alignment may be performed at chapters, paragraphs, or sentences. While full sentence-level alignment is rare in literary translation due to authentic translation behaviors (insertions, deletions, splits), both chapter-level (e.g., JAM) and paragraph-level (e.g., BiPaR) alignments are practical and commonly used.
| Dataset | Domain | Alignment Unit | Annotation Type |
|---|---|---|---|
| JAM | Literature | Chapter | Parallel text |
| BiPaR | Literature | Paragraph | MRC triples (QA) |
| VidAtlas | Video | Chapter | Titles, summaries, video |
2. Data Collection and Annotation Methodologies
Precision and comprehensiveness in annotation are achieved through a multi-stage pipeline:
- BiPaR: Paragraphs from six bilingual novels are pre-selected and filtered via heuristics (e.g., length mismatch, genre avoidance). 150 bilingual crowd annotators generate ≥3 question–answer pairs per parallel paragraph in both languages. Strict quality control follows: 30% of each annotator's work is reviewed by three bilingual reviewers, with secondary sampling checked by a domain expert. Iterative revision ensures >95% accuracy, ultimately yielding 14,668 high-quality parallel QA pairs (Jing et al., 2019).
- JAM: Chapters from 160 English novels (with professional Chinese translations) are extracted and aligned via automated detection of chapter breaks, with manual verification of first and last paragraphs to ensure alignment. Heuristic filtering removes chapter pairs with token-length ratios above 3.0. No attempt is made at sentence-level alignment, preserving the translation's authentic literary structure (Jin et al., 12 Jul 2024).
- VidAtlas: Multimodal chapter annotation is achieved via a semi-automatic three-stage process: (1) merging ASR transcripts and visual captions/OCR results into a unified multimodal transcript, (2) prompting a LLM to generate structured chapter boundaries, titles, abstracts, and introductions grounded in uploader-provided markers, and (3) generating all annotation fields simultaneously in English and Chinese using a bilingual prompt template, ensuring 1:1 alignment at both sentence and chapter level (Pu et al., 18 Nov 2025).
3. Dataset Structure, Statistics, and Annotations
Datasets vary in size, structure, and types of annotated fields.
- JAM Corpus Structure: Each split contains files per chapter and language, e.g.,
BOOKID_CH01.en,BOOKID_CH01.zh. Metadata encodes book and chapter ID, language, and split. The corpus contains 5,373 aligned chapter pairs (4,484 train; 546 valid; 343 test) from 160 books. English: 10.4M tokens, 548.5K sentences; Chinese: 11.9M tokens, 700.9K sentences (Jin et al., 12 Jul 2024). - BiPaR Format: Each JSONL entry includes passage, question, and answer fields in both English and Chinese, with answer positions as contiguous spans (SQuAD style). 3,667 parallel paragraphs, with 14,668 total parallel QA pairs, mean passage length ≈227 English tokens or ≈198 Chinese tokens (Jing et al., 2019).
- VidAtlas Video Chapters: 410,000+ videos, 115,000+ total hours, ≈2.25 million chapters per language across 16 high-level categories. Each chapter contains a short title (≤8 words), structural chapter information (title, abstract, introduction), and fine-grained event descriptions, all paired EN/CN and temporally aligned with video segments. Annotation structure is hierarchical (short title ⊂ title ⊂ abstract ⊂ introduction ⊂ event description), with strict adherence to temporal boundaries (Pu et al., 18 Nov 2025).
| Dataset | #Chapters | Tokens (EN/ZH) | QA Pairs | Domains |
|---|---|---|---|---|
| JAM | 5,373 | 10.4M / 11.9M | – | Literary novels |
| BiPaR | 3,667 | ≈227/198 per para | 14,668 | Novel passages (QA) |
| VidAtlas | 2.25M | (n/a, video text) | – | Multimedia (16+ domains) |
4. Evaluation Protocols and Baseline Performance
Evaluation protocols use both automatic metrics and human–machine comparisons.
- BiPaR: Exact match (EM) and F1—both at the token level per the SQuAD paradigm. EM: proportion of exact answer string matches. F1: harmonic mean of span-level precision and recall. Baseline models include DrQA, BERT_base (EN, ZH), and BERT_large (EN only). State-of-the-art BERT_largemodel achieves EM ≈42.5%, F1 ≈56.5% (EN), with human F1 ≈91.9%. For Chinese BERT_base: EM ≈48.9%, F1 ≈64.1%, human F1 ≈92.1%. Multilingual BERT performance is similar to monolingual baselines. Cross-lingual QA approaches degrade further (e.g., F1 ≈53.3% for Q_zh → P_en; F1 ≈46.4% for Q_en → P_zh), while alignment-based answer projection yields much lower F1 (≲24%) (Jing et al., 2019).
- JAM: BLEU (Papineni et al., 2002), COMET, and BlonDe (with sub-scores for pronouns, entities, tense, discourse markers) are reported. The Ch2Ch setting introduces the challenge of context-aware translation across entire chapters that present intricate discourse phenomena and structural misalignments (Jin et al., 12 Jul 2024).
- VidAtlas: Evaluated by F1, SODA, and the GRACE metric, which combines temporal overlap (IoU) and BERTScore semantic similarity under a many-to-one segment matching scheme:
This metric better reflects real-world structural flexibility than strict one-to-one matching (Pu et al., 18 Nov 2025).
5. Licensing, Distribution, and Access
- JAM: Released for research use; users must confirm public domain status of titles before any commercial application. Distributed as plain-text files (
train/,valid/,test/). Access via GitHub, HuggingFace (pending), or email request (Jin et al., 12 Jul 2024). - BiPaR: CC BY-NC or similar research license, available at https://multinlp.github.io/BiPaR/. Evaluation scripts and a basic reader are provided (Jing et al., 2019).
- VidAtlas: No explicit license indicated; context suggests availability for academic research. Data structure follows the train/val/test split, each containing thousands of videos and millions of chapter entries, with each annotation present in both English and Chinese (Pu et al., 18 Nov 2025).
6. Research Applications and Significance
Bilingual English–Chinese chapter datasets are enabling resources for several research paradigms:
- Multilingual and Cross-lingual Machine Reading Comprehension: BiPaR facilitates QA modeling spanning languages and direct benchmarking of cross-lingual transfer performance in novelistic contexts (Jing et al., 2019).
- Context-aware Literary Translation: JAM supports models trained on long-range context, necessitating advanced handling of discourse, anaphora, and authentic translation misalignments. The Ch2Ch translation task complicates evaluation and provides a more realistic scenario than rigid sentence-level alignment (Jin et al., 12 Jul 2024).
- Hierarchical and Multimodal Content Structuring: VidAtlas powers research into chapter-based navigation and summarization of long videos, featuring temporally grounded, multi-level, and bilingual annotations—essential for robust models in multimedia and educational domains (Pu et al., 18 Nov 2025).
- Transfer and Pretraining: Parallel chapters or QA data assist in pretraining or fine-tuning models for low-resource languages or domains, providing fine-grained supervision unavailable in monolingual datasets.
- Discourse Phenomena and Long-context Modeling: Datasets emphasize the paper of cross-lingual treatment of discourse markers, coreference, causality, and narrative flow unavailable in sentence-level corpora.
A plausible implication is that ongoing expansion and curation of such datasets will further drive the development of robust, highly context-sensitive multilingual NLP and multimodal systems.
7. Current Limitations and Future Directions
Persistent limitations include:
- Alignment granularity and translation fidelity: Even with manual verification, sentence-level misalignments or fluid translation remain prevalent, particularly in authentic literary corpora (evidenced by the JAM sample where 18/50 paragraphs showed sentence misalignment) (Jin et al., 12 Jul 2024).
- Domain coverage: Existing literary chapter datasets are dominated by classic novels and do not cover technical, scientific, or informal/online genres.
- Evaluation challenges: Standard metrics (e.g., BLEU, SQuAD-style EM/F1) may fail to adequately capture the complexity and flexibility required by human translation, summary, or QA—necessitating newer, structure- and semantics-aware metrics like GRACE (Pu et al., 18 Nov 2025).
- Scaling to other resource pairs: While English–Chinese data is growing, similar resources for other language pairs are sparse.
Future directions likely include larger, more diverse collections spanning additional genres and modalities, improved alignment and annotation pipelines leveraging LLMs, and the development of universally robust, context- and discourse-aware evaluation metrics.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free