Self-Correction Code-Switching Dataset
- The dataset is a curated resource designed to evaluate model self-correction in mixed-language mathematical reasoning and grammatical error correction.
- It employs a staged progression from English-only to mixed and finally Korean outputs to realign language-specific internal representations.
- Quantitative metrics and detailed annotation guidelines ensure robust cross-lingual performance and enhanced error recovery.
A self-correction code-switching dataset is a curated resource designed to elicit and evaluate self-correction abilities in models handling mixed-language (code-switched) text, particularly in contexts of mathematical reasoning and grammatical error correction. This dataset format is central to assessing and training multilingual and multi-modal systems to reflect, identify, and rectify errors in conversational, educational, and reasoning tasks when language boundaries are crossed within single utterances or solution traces. Such resources serve as alignment signals for internal model representations, especially when mathematical or reasoning skills are primarily developed in English but must be robustly extended to other languages such as Korean.
1. Motivation and Theoretical Underpinnings
The rationale for self-correction code-switching datasets is rooted in empirical findings and architectural analyses of LLMs. In models trained on predominantly English data, mathematical reasoning and self-correction chains are English-centric, and low-resource language prompts (e.g., Korean) fail to trigger effective chain-of-thought (CoT) behavior. Early Transformer layers tend to perform implicit translation of Korean into English. However, breakdowns occur at the intersection of translation and reasoning, especially for mathematical tasks requiring nuanced error detection and correction (Kim et al., 9 Jan 2026).
To address this, a principled curriculum is devised where solution traces expose models to English-only reasoning, incrementally code-switched (English–Korean mixed) reasoning, and target-language (Korean-only) completion. This “E → Mixed → K” progression is designed to realign language-specific neurons in early layers, providing a more direct internal translation path for reasoning (Kim et al., 9 Jan 2026). A similar motivation—addressing the shortcomings of existing grammatical error correction (GEC) systems on code-switched (CSW) learner data—underpins datasets for CSW GEC research (Potter et al., 2024, Chan et al., 2024).
2. Construction Methodology and Annotation Guidelines
Data Sourcing and Generation
In the multilingual mathematical reasoning context, the primary data sources include MathDial (multilingual math dialog tutoring corpus), GSM8K, MATH, and Omni-MATH datasets. Self-corrected solution traces are produced via structured GPT-4 prompting, ensuring a three-stage output:
- Stage 1: English-only error detection and reflection.
- Stage 2: Gradual mixing of English and Korean, shifting towards Korean grammar and lexis.
- Stage 3: Korean-only final answer and justification.
Prompts explicitly require minimal faithful backtracking—identifying the earliest point of mathematical error, correcting it, then continuing to a fully correct solution (Kim et al., 9 Jan 2026).
For GEC, datasets combine small authentic learner corrections (Lang-8 corpus, with >5,800 pairs) and synthetic code-switched errorful text generated through:
- Translation-based subtree injection.
- Parallel-corpus alignments using models like AWESOME-align, to swap syntactic subtrees cross-lingually.
- LLM-prompted generation to match authentic code-mixing patterns. Artificial errors are injected either rule-based (PIE-Synthetic) or via back-translation with error-generating models such as Rev-GECToR (Potter et al., 2024, Chan et al., 2024).
Annotation Practices
- Mathematical code-switching traces: Annotators ensure staged language use, explicit error correction, and linguistic transition at the sentence or sub-sentence level.
- GEC: Error types are tagged according to the ERRANT taxonomy (e.g., DET, NOUN, PRON, PUNCT), and non-English tokens are marked as “no-correction needed.” Gold token-level edits are recorded, especially for manually validated Lang-8 data.
3. Dataset Composition and Quantitative Statistics
Self-Correction Code-Switching Datasets (Mathematical)
- Size: Approximately 1,000 examples.
- Domain Stratification:
- 40% grade-school (GSM8K)
- 45% middle-school/competition (MATH, Omni-MATH)
- 15% Olympiad-level
- Difficulty Levels: Spanning basic arithmetic to Olympiad, naturally stratified by MathDial content.
- Example Structure:
- Bilingual problem statement.
- Incorrect solution up to the first error.
- Three-stage correction: English → Mixed → Korean (Kim et al., 9 Jan 2026).
Code-Switching GEC Datasets
- Authentic CSW Data: 5,875 sentence pairs, spanning English–Japanese (82%), English–Korean (13%), others (3–0.1%).
- Synthetic CSW Data: Rules and back-translation yield ~88,000 sentence pairs; LLM prompting (GPT-3.5) contributes 73,293 synthetic sentences, closely matching code-mixing patterns of authentic data (CMI, I-Index, CF metrics).
- Total: ~94,000 sentence pairs per major code-switched language (EN–KO, EN–ZH, EN–JA) (Potter et al., 2024, Chan et al., 2024).
4. Structural Example and Coding Schema
An example entry from the Korean mathematical self-correction dataset:
- Problem: “Bruno wants to buy two and one-half dozens of pens. How many pens will he have?”
- Incorrect solution: “Two dozens = 2 × 12 = 24; … 24 + 18 = 42 pens.”
- Correction trace:
- English only: “Wait… I was treating ‘two and one-half’ as separate 2 + 1.5 dozens.”
- Mixed: “That was my mistake — 해석이 잘못되었네요. I should multiply 2.5 by 12.”
- Korean only: “12개씩이므로 2.5 × 12 = 30개가 정답입니다.” (Kim et al., 9 Jan 2026)
For GEC data, code-switching typically involves a one-token switch or a single syntactic subtree, with the error position and correction annotated at the token level. These are leveraged in sequence-tagging architectures (e.g., GECToR with XLM-RoBERTa or RoBERTa base), trained and evaluated according to the ERRANT metric.
5. Quantitative Metrics and Evaluation Protocols
Mathematical Reasoning Datasets
- Generation Difficulty:
- Conditioned Answer Score (CAS):
- Direct Answer Score (DAS):
- Higher cross-entropy is observed for Korean compared to English, reflecting increased difficulty in self-correction (Kim et al., 9 Jan 2026).
Neuron Importance for Alignment:
- RL Reward Schedules:
- Outcome-based: if correct, if incorrect.
- Format-based: if explicit self-correction markers (e.g., > ) are used, otherwise. > - KL-constrained Generalized Reinforcement with Policy Optimization (GRPO) updates use candidate completions per input, with > > > > applied for policy updates (Kim et al., 9 Jan 2026). > > ### GEC Datasets > > - Core Metrics: > - CMI (Code-Mixing Index): measures the proportion of non-English content. > - I-Index: counts per-word language switches. > - (ERRANT): precision-weighted edit quality metric, cross-validated on both monolingual and genuine CSW test sets. > > - Performance Benchmarks: > - Monolingual GECToR baseline on CSW, > - With synthetic CSW + inference tweaks, > - Ablation: removing synthetic data reduces by 5–8% (Potter et al., 2024, Chan et al., 2024). > > ## 6. Downstream Integration: Model Fine-Tuning and RL > > ### Mathematical Reasoning > > - Fine-tuning: > - Continual pretraining on self-correction code-switching traces, typically for three epochs (LR , warm-up 0.01). > - LoRA adapters are applied to all layers; progression over code-switching phases is key to capturing internal translation and alignment. > - Direct Preference Optimization (DPO) involves pairwise preference training between English CoT and code-switched self-correction trajectories. > - Neuron-specific tuning updates only the top 1% of importance-ranked neurons, freezing all others. > > - Reinforcement Learning: > - Post neuron-level fine-tuning, GRPO reinforcement learning is conducted, leveraging explicit reward signals for correction format and correctness. > - This curriculum transitions measured self-correction rates from ∼6% to ∼70% on competitive Korean mathematical tasks (Kim et al., 9 Jan 2026). > > ### GEC > > - Multi-stage Curriculum: > - Pre-train on distilled monolingual + small synthetic PIE-CSW. > - Continue on large mixed GEC corpora + all synthetic CSW sets. > - Fine-tune on high-quality, hand-validated, and sampled genuine CSW, with final validation using inference hyperparameter tuning for edit confidence thresholds. > - Gains are preserved on both monolingual and code-switched evaluation sets, with error-type breakdowns revealing notable improvements in noun, pronoun, word order, and punctuation categories (Potter et al., 2024, Chan et al., 2024). > > ## 7. Impact, Generalization, and Availability > > Self-correction code-switching datasets have proved crucial for aligning LLM reasoning abilities across language boundaries, especially in low-resource targets like Korean mathematical and logical reasoning tasks. Their design principles highlight the importance of modeling internal translation pathways, staged code-switching for neuron-level alignment, and explicit labeling of self-correction behavior. > > Cross-lingual generalization is strong: models trained on one English–X code-switched pairing (e.g., EN–ZH) generalize well to others (EN–KO, EN–JA), with performance on monolingual corpora unaffected (Chan et al., 2024). In CSW GEC, artificial data generation methodologies have been shown to replicate the code-mixing properties of genuine learner output, and their use is indispensable given the limited availability of authentic CSW error-correction data (Potter et al., 2024). > > Resource Availability: For GEC, code, generation pipelines, and human-reannotated splits are available at https://github.com/kelvinchanwh/csw-gector (Chan et al., 2024). In mathematical reasoning, detailed curriculum prompts, filtering protocols, and ablation analyses are provided in the respective paper appendices (Kim et al., 9 Jan 2026). > > Self-correction code-switching datasets thus serve as high-signal, curriculum-aligned benchmarks and training resources for robust, multilingual error recovery and mathematical reasoning. Their impact is observable in rapid performance gains in self-correction, elevated cross-lingual transfer, and scalable benchmarks for aligning model internals across language and modality boundaries.