The paper "Demystifying Multilingual Chain-of-Thought in Process Reward Modeling" investigates the extension of process reward models (PRMs) to multilingual settings for LLMs, focusing on complex tasks requiring multi-step reasoning. The authors address the limitations of current reward models, which are predominantly focused on English, by translating existing PRM datasets from English into six additional languages, creating a comprehensive multilingual dataset.
Key Components and Methodology:
- Multilingual Dataset Creation: The authors translate existing datasets (PRM800K and Math-Shepherd) into seven languages, allowing for the training and evaluation of multilingual PRMs.
- Experimental Setups: Three PRM setups are defined:
- PRM-mono: Trained and evaluated on a single language.
- PRM-cross: Trained on one language but evaluated across multiple languages.
- PRM-multi: Trained on multiple languages and evaluated on a broader set.
Evaluation Metrics: The work employs two reasoning benchmarks across 11 languages (including both seen and unseen languages) to evaluate the performance of the multilingual PRMs against existing models.
Findings:
- Performance Superiority: Multilingual PRMs consistently outperform monolingual and cross-lingual PRMs across various LLMs, improving the average accuracy by up to +1.5 points over PRM-mono.
- Sensitivity: The performance of multilingual PRMs is sensitive to the number of training languages and the volume of English data. Optimal performance is observed when leveraging a moderate number of languages and balancing English data representation.
- Error Reduction: Multilingual PRMs reduce early-stage reasoning errors, suggesting that diverse language training enhances reasoning reliability.
- Parameter Efficiency: Greater numbers of trainable parameters and candidate responses amplify the benefits of PRMs in multilingual contexts.
Implications:
- Generalization Beyond English: The findings highlight the potential for training robust multilingual PRMs that generalize effectively across a wide spectrum of languages.
- Step-by-Step Reinforcement Learning: By leveraging PRMs in a reinforcement learning framework, LLMs can receive more granular feedback, potentially refining reasoning processes further.
The paper provides substantial empirical evidence, suggesting that diverse multilingual input during training can overcome language-specific biases and improve cross-lingual transfer, ultimately enhancing the global applicability of LLMs. Additionally, the work opens avenues for developing universally applicable reasoning models, thereby addressing key challenges in multilingual AI. The authors also release the code to encourage continued research and development in this area.