Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization (2401.06838v3)

Published 12 Jan 2024 in cs.CL
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization

Abstract: Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages.

The paper "MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization" (She et al., 12 Jan 2024 ) addresses the observed disparity in reasoning capabilities of LLMs across different languages. While reasoning is often considered language-agnostic, LLMs frequently demonstrate superior performance in high-resource languages like English compared to lower-resource languages, attributed largely to imbalances in multilingual training corpora. The work introduces the Multilingual-Alignment-as-Preference Optimization (MAPO) framework designed to mitigate this gap by aligning the reasoning processes in non-dominant languages with those generated in a dominant language (English).

MAPO Framework Methodology

The MAPO framework operates by generating a preference signal based on the alignment between reasoning chains produced in a non-dominant language and a corresponding chain in the dominant language for the same input query. This preference signal is then used to fine-tune the LLM using standard preference optimization algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO). The core idea is that a non-dominant language reasoning process that is semantically closer or more "translatable" to the dominant language reasoning process (assumed to be more reliable) is preferred.

Preference Estimation via Multilingual Alignment

  1. Data Generation: For a given input question xx (e.g., a mathematical problem), the LLM policy πθ\pi_\theta (initially an SFT model) generates a reasoning process Yˉ\bar{Y} in the dominant language (English) and multiple reasoning process samples {Y1,...,Yk}\{Y_1, ..., Y_k\} in a non-dominant target language.
  2. Alignment Scoring: An off-the-shelf multilingual translation model MtransM_{trans} is employed to assess the alignment between each non-dominant language reasoning process YiY_i and the dominant language reasoning process Yˉ\bar{Y}. The alignment score is typically derived from the negative log-likelihood or perplexity of translating YiY_i to Yˉ\bar{Y}, essentially PMtrans(YˉYi)P_{M_{trans}}(\bar{Y}|Y_i). A higher probability (lower perplexity) signifies better alignment and is interpreted as higher preference.
  3. Preference Data Formulation:
    • For PPO, the alignment score PMtrans(YˉYi)P_{M_{trans}}(\bar{Y}|Y_i) serves directly as the reward r(x,Yi)r(x, Y_i) for generating YiY_i.
    • For DPO, the sampled non-dominant reasoning processes {Y1,...,Yk}\{Y_1, ..., Y_k\} are ranked based on their alignment scores. Pairs (Yw,Yl)(Y_w, Y_l) are constructed where YwY_w has a higher alignment score (winner) than YlY_l (loser), forming the preference dataset D={(x,Yw,Yl)}D = \{(x, Y_w, Y_l)\}.

Preference Optimization

The estimated preferences are used to fine-tune the LLM πθ\pi_\theta:

  1. Using PPO: The objective is to maximize the expected reward obtained from the alignment score, while regularizing the policy shift using a KL divergence penalty against the initial SFT policy πSFT\pi_{SFT}:

    LPPO=E(x,Y)πθ[r(x,Y)]βKL(πθ(Yx)πSFT(Yx))L_{PPO} = \mathbb{E}_{(x,Y) \sim \pi_\theta}[r(x,Y)] - \beta KL(\pi_\theta(Y|x) || \pi_{SFT}(Y|x))

    Here, r(x,Y)=PMtrans(YˉY)r(x,Y) = P_{M_{trans}}(\bar{Y}|Y) is the alignment-based reward.

  2. Using DPO: The DPO loss function directly optimizes the policy πθ\pi_\theta to increase the likelihood of preferred sequences YwY_w over dispreferred sequences YlY_l, relative to a reference policy πref\pi_{ref} (typically the frozen SFT model):

    LDPO=E(x,Yw,Yl)D[logσ(βlogπθ(Ywx)πref(Ywx)βlogπθ(Ylx)πref(Ylx))]L_{DPO} = -\mathbb{E}_{(x, Y_w, Y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(Y_w|x)}{\pi_{ref}(Y_w|x)} - \beta \log \frac{\pi_\theta(Y_l|x)}{\pi_{ref}(Y_l|x)} \right) \right]

    where σ\sigma is the sigmoid function and β\beta is a hyperparameter controlling the deviation from the reference policy.

  3. Iterative DPO (iDPO): The paper also explores an iterative application of DPO, where the model optimized in one iteration (πθi\pi_{\theta_i}) is used as the sampling policy to generate data for the next DPO iteration (πθi+1\pi_{\theta_{i+1}}), potentially leading to progressive refinement.

Experimental Setup

  • Base Models: The experiments utilized LLaMA2 (7B, 13B) and Mistral (7B) models, specifically versions already fine-tuned for multilingual mathematical reasoning (MathOctopus, MetaMathOctopus, MistralMathOctopus) via supervised fine-tuning (SFT) on translated reasoning data (MGSM8KInstruct).
  • MAPO Training Data: Preference data for MAPO was generated using mathematical problems from a subset of NumGLUE (tasks 1, 4, 8), translated into 9 non-English languages (part of MNumGLUESub). This dataset was distinct from the initial SFT dataset.
  • Alignment Model: The NLLB-600M-distilled model was used as the default MtransM_{trans} for calculating alignment scores.
  • Evaluation Benchmarks: Performance was evaluated on three multilingual mathematical reasoning datasets covering 10 languages (English + 9 non-English):
    • MSVAMP: Out-of-domain multi-step arithmetic word problems.
    • MGSM: Multi-step grade school math problems (in-domain relative to SFT).
    • MNumGLUESub: Numerical reasoning problems (in-domain relative to MAPO preference data generation).
  • Metrics:
    • Accuracy: Percentage of correctly solved problems.
    • PPL-based Alignment Score: Average perplexity assigned by the NLLB model between non-English and English reasoning chains (lower is better), measuring reasoning process consistency.
    • Answer Consistency Ratio (ACR): Jaccard index between the sets of problems solved correctly in English versus a non-English language, measuring answer consistency.
  • Baselines: Performance was compared against the base SFT models and m-RFT (Rejection sampling Fine-Tuning based on final answer correctness).

Results and Findings

MAPO demonstrated significant improvements in multilingual reasoning performance across various base models and benchmarks:

  • Accuracy Gains: On average across 9 non-English languages, MAPO applied to MathOctopus 7B yielded substantial improvements:
    • +16.2% on MSVAMP (out-of-domain)
    • +6.1% on MGSM (in-domain SFT)
    • +13.3% on MNumGLUESub (in-domain MAPO)
    • Similar gains were observed for larger models (13B) and Mistral-based models. Notably, the largest gains were often seen in languages with lower baseline performance (e.g., Bengali, Thai, Swahili).
  • Improved Generalization: The strong performance increase on the out-of-domain MSVAMP dataset suggests MAPO fosters better generalization of reasoning skills compared to methods like m-RFT, which showed negligible improvement on MSVAMP.
  • Enhanced Consistency: MAPO led to improved reasoning consistency between non-dominant languages and English, as evidenced by:
    • Lower (better) PPL-based alignment scores.
    • Higher Answer Consistency Ratio (ACR).
    • This indicates that MAPO successfully aligns not just the final answers but also the intermediate reasoning steps.
  • PPO vs. DPO: Both PPO and DPO implementations of MAPO proved effective. DPO appeared slightly more sample-efficient in early training stages, while iterative DPO showed potential for further gains.
  • Ablation Studies: Ablations confirmed the importance of using alignment scores over simpler rewards (like final answer correctness used in m-RFT) and the benefit of using translated dominant language reasoning (Yˉ\bar{Y}) compared to reference solutions.

Significance and Implications

The MAPO framework offers a practical approach to enhance LLM reasoning in non-dominant languages without requiring costly human annotations of reasoning steps in multiple languages. By leveraging the stronger reasoning capabilities typically present in a dominant language like English and using automated translation models to create an alignment-based preference signal, MAPO effectively transfers reasoning proficiency.

The method's success, particularly on out-of-domain tasks, indicates that optimizing for cross-lingual reasoning alignment encourages the model to learn more fundamental, language-agnostic reasoning patterns rather than merely overfitting to specific language data. The explicit optimization towards consistency leads to more reliable and predictable reasoning behavior across the supported languages. This alignment strategy provides a scalable way to improve the equity of reasoning performance in multilingual LLMs.

In conclusion, MAPO presents a novel preference optimization strategy centered on cross-lingual reasoning alignment. By using translation models to generate preference signals comparing non-dominant language reasoning to dominant language reasoning, it substantially improves multilingual mathematical reasoning accuracy and consistency, demonstrating effectiveness across different base models and benchmarks, especially enhancing generalization to out-of-domain tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  3. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246.
  4. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  5. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting.
  6. Soochan Lee and Gunhee Kim. 2023. Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models.
  7. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. arXiv preprint arXiv:2204.05660.
  8. OpenAI. 2022. https://openai.com/blog/chatgpt.
  9. Training language models to follow instructions with human feedback.
  10. Are nlp models really able to solve simple math word problems?
  11. Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages.
  12. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  13. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  14. Language models are multilingual chain-of-thought reasoners.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288.
  16. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  17. Chain-of-thought prompting elicits reasoning in large language models.
  18. Tree of thoughts: Deliberate problem solving with large language models.
  19. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284.
  20. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  21. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
  22. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
  23. Texygen: A benchmarking platform for text generation models. CoRR, abs/1802.01886.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shuaijie She (8 papers)
  2. Shujian Huang (106 papers)
  3. Wei Zou (62 papers)
  4. Wenhao Zhu (32 papers)
  5. Xiang Liu (475 papers)
  6. Xiang Geng (13 papers)
  7. Jiajun Chen (125 papers)
Citations (15)