MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages (2509.26601v1)

Published 30 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Ensuring native-like quality of LLM responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

Summary

The paper presents MENLO, a comprehensive framework that evaluates native-like LLM responses across 47 languages by integrating localized prompt design and sociolinguistic theory.
The study demonstrates that pairwise RL-trained judges significantly improve preference accuracy by up to 18.0% and ensure high inter-annotator agreement.
Methodologically, MENLO leverages human-annotated response pairs and reward shaping to optimize LLM evaluations, setting a new standard for multilingual proficiency alignment.

MENLO: Evaluating and Modeling Native-like Quality Across 47 Languages

Introduction and Motivation

MENLO introduces a comprehensive framework for evaluating and improving native-like response quality in LLMs across 47 language varieties. The framework is grounded in sociolinguistic theory, specifically audience design, and operationalizes native-like quality along four dimensions: fluency, tone, localized tone, and localized factuality. MENLO addresses the limitations of prior multilingual evaluation datasets, which often rely on English-centric prompts, translated responses, and narrow dimensions such as linguistic acceptability or naturalness. Instead, MENLO provides localized prompts, high inter-annotator agreement ( $\alpha = 0.84$ ), and a focus on contextually appropriate, native-like output.

MENLO Dataset Construction

MENLO comprises 6,423 annotated prompt–response preference pairs, totaling 81,014 human annotations. Prompts are human-written parametric templates, translated and localized by native speakers for each language variety. Annotation guidelines are meticulously crafted to reduce subjectivity, breaking down each dimension into concrete rubrics and signals. Each response pair is rated by at least three annotators on a 1–5 Likert scale, with majority vote aggregation. The dataset covers a typologically diverse set of languages, including major regional variants and romanized forms of non-Latin scripts.

MENLO achieves higher inter-annotator agreement and broader language coverage than previous datasets such as Pariksha, Recon, M-RewardBench, and MM-Eval. The annotation process incorporates expert annotators for quality control and a custom annotation tool for efficient data collection and screening.

Evaluation of LLM Judges

MENLO systematically benchmarks LLMs as automatic judges of native-like quality. The evaluation setup contrasts pointwise (single response grading) and pairwise (simultaneous grading of two responses) paradigms, with and without detailed grading rubrics. Results demonstrate that pairwise evaluation yields substantial improvements over pointwise, with up to $+12.4\%$ Macro-F1 and $+18.0\%$ Preference accuracy gains. The presence of explicit rubrics further enhances judge reliability, especially in pointwise setups.

Pairwise evaluation anchors judgments, reducing variance and improving alignment with human annotators. The effect is robust across open-source and API models, including Qwen3-4B, Qwen3-32B, Llama-3.1-8B, Llama-3.3-70B, Llama4-Scout, o3, gpt-4o, and gpt-4.1.

Training LLM Judges: SFT vs. RL

MENLO explores supervised fine-tuning (SFT) and reinforcement learning (RL) for judge training, using Qwen3-4B and Llama4-Scout as representative architectures. RL-trained judges consistently outperform SFT, with Qwen3-4B RL achieving $+4.0\%$ Macro-F1 and $+2.9\%$ Preference accuracy over zero-shot. Reward shaping—combining binary accuracy, smoothing for near-miss predictions, and preference bonuses—further boosts performance.

For reasoning-oriented models, SFT without chain-of-thought supervision degrades performance, while RL incentivizes explicit reasoning and robust score assignment. Multi-task RL (joint training across all dimensions) matches or exceeds single-task optimization, except for localized factuality, which remains challenging and may require retrieval or tool augmentation.

Cross-Language and Per-Dimension Analysis

MENLO provides granular analysis of judge performance across languages and dimensions. Preference accuracy varies widely by language, with tr_TR (Turkish) at $82.1\%$ and bn_BD (Bengali) at $37.9\%$ . Gains do not strictly correlate with resource levels, highlighting the importance of localized prompt and annotation design. English-only training fails to generalize, underscoring the necessity of multilingual data for robust cross-lingual alignment.

Tone and localized tone dimensions achieve the highest Macro-F1 and Preference accuracy, while localized factuality lags, indicating persistent challenges in grounding responses in local context.

Figure 1: Preference accuracy per language for pairwise RL-trained Qwen3-4B, illustrating substantial cross-lingual variance and the impact of localized training.

Generative Reward Models and Policy Improvement

MENLO demonstrates that RL-trained judges can serve as generative reward models (RMs) for policy optimization. Post-training Qwen3-4B with a pairwise RL-trained judge yields consistent quality gains across LLM and human evaluators. LLM judges report average score improvements of $+0.80$ to $+1.16$ (up to $+36.1\%$ ), while human raters observe a more conservative $+10.9\%$ gain. This reveals a systematic tendency for LLM judges to overestimate improvements relative to human annotators, a critical consideration for future reward modeling.

The two-stage evaluation strategy—automated LLM judging across all languages and human validation on high-resource subsets—confirms the effectiveness of MENLO's unified framework for scalable multilingual proficiency alignment.

Implementation Considerations

MENLO's framework is designed for practical deployment:

Annotation Pipeline: Requires native speaker recruitment, localized prompt engineering, and robust annotation tooling.
Judge Training: RL with reward shaping is preferred, leveraging pairwise templates and explicit reasoning. Multi-task training is efficient and effective for most dimensions.
Policy Optimization: Generative RMs can be integrated into RLHF pipelines for multilingual post-training, with careful calibration against human raters to avoid overestimation.
Scaling: MENLO's approach is scalable to new languages and domains, provided sufficient native annotation resources and localized prompt design.

Implications and Future Directions

MENLO sets a new standard for multilingual LLM evaluation and alignment, emphasizing native-like quality over mere acceptability or translationese. The framework's sociolinguistic grounding and rigorous annotation methodology enable nuanced assessment and improvement of LLMs in real-world conversational settings.

Persistent challenges remain in modeling localized factuality and mitigating LLM judge overconfidence. Future work should explore retrieval-augmented evaluation, pluralistic alignment, and automated rubric generation. MENLO's dataset and framework provide a foundation for research into cross-lingual transfer, cultural adaptation, and scalable human-in-the-loop evaluation.

Conclusion

MENLO operationalizes native-like quality evaluation and improvement for LLMs across 47 languages, integrating sociolinguistic theory, localized prompt design, high-quality annotation, and RL-based judge training. Pairwise evaluation and reward shaping are critical for reliable automated judging, and generative RMs enable direct policy optimization. MENLO's unified framework advances scalable multilingual proficiency alignment, with ongoing challenges in localized factuality and LLM-human agreement warranting further investigation.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces Menlo, a way to check how “native-like” AI chat responses sound across many languages. “Native-like” means the reply feels fluent, natural, and culturally right—so a local speaker wouldn’t think it was written by a machine or by someone who doesn’t really know the language or the place.

The authors build:

A clear framework for what “native-like” means.
A large dataset with human ratings in 47 language varieties.
Better AI “judges” that can score responses more reliably.
A way to use these judges to actually improve AI’s multilingual writing quality.

What questions did the researchers ask?

They wanted to know:

How do we define and measure “native-like” quality in different languages and places?
Can AI models act as fair judges of response quality, like teachers grading essays?
What judging setup works best: scoring one answer at a time, or comparing two answers side-by-side?
Can training these AI judges make them better than simple, untrained judges?
Can these judges also help improve the AI’s writing, not just score it?

How did they do it?

First, they created the Menlo framework, which breaks “native-like” quality into four dimensions. Think of this like a grading rubric teachers use:

Fluency: Is the language smooth and correct?
Tone: Does the reply sound appropriate and natural for the situation?
Localized tone: Does it match the style and culture of a specific place or variety (for example, Mexican Spanish vs. Spanish in Spain)?
Localized factuality: Are the facts accurate for that local context (names, holidays, laws, places)?

Then they built a dataset:

They wrote prompt templates in English that clearly define who the audience is (for example, talking to a local group), then translated and localized them for 47 language varieties.
They asked native speakers to rate two AI-generated responses per prompt on a 1–5 scale (1 is poor, 5 is excellent).
They used clear, step-by-step rubrics so different annotators would judge the same way. The raters agreed with each other a lot, which is a good sign that the rules were clear.

Next, they tested ways for AI models to judge response quality:

Pointwise scoring: grade one response by itself.
Pairwise scoring: grade two responses side-by-side for the same prompt. This is like comparing two essays at once, which makes judging easier.
With and without rubrics: they checked if giving the AI judges the same grading rules helps.

Finally, they trained AI judges:

They used reinforcement learning (RL), which is like practicing judging and giving the AI “points” when it judges correctly.
They shaped the reward, meaning they didn’t only give full points for perfect scores—they also gave partial points when the judge was close and extra points when it picked the better of two answers correctly.
They tested single-task training (one skill at a time) and multi-task training (learning all four dimensions together).

After that, they used the trained judges to improve an AI writer:

The judge scores pairs of AI responses so the AI can learn which writing is better.
They checked improvements using other AI judges and human annotators.

What did they find, and why does it matter?

Comparing two answers side-by-side (pairwise) works much better than judging one at a time. It helps both simple and advanced models judge more accurately.
Giving judges clear rubrics also helps, especially when they score one response at a time.
Training judges with reinforcement learning, shaped rewards, and multi-task learning made them stronger—sometimes close to human-level agreement.
These trained judges can be used to teach AI models to write better across languages. After training, the improved AI responses were preferred in most cases by both AI judges and humans.
However, AI judges tend to be more optimistic than humans about how much quality improved. In short, the judges agree that things got better—but they often think the improvement is bigger than human reviewers do.

This matters because good, native-like AI replies help people worldwide use AI more comfortably. It’s not enough for answers to be grammatically correct; they also need to fit local culture, style, and facts.

What does this mean for the future?

Scalable multilingual evaluation: The Menlo framework and dataset make it easier to judge AI quality across many languages consistently.
Better AI alignment: Trained judges can guide AI models to produce more fluent, culturally sensitive, and locally accurate responses.
Practical path forward: Pairwise judging, clear rubrics, and reinforcement learning together form a strong toolkit for improving multilingual AI.
Remaining challenges: AI judges still don’t perfectly match human judgment, especially on local facts. Future work may combine judges with tools like search or retrieval to handle locally grounded information more reliably.

If you want to explore the dataset, it’s available here: https://huggingface.co/datasets/facebook/menlo

Overall, Menlo shows a clear recipe for making AI chat more truly “native-like” across the world: define what “good” looks like, collect high-quality human ratings, train AI judges to follow those rules, and use those judges to teach AI to write better.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, action-oriented list of the paper’s unresolved knowledge gaps, limitations, and open questions to guide future research.

The four evaluation dimensions (fluency, tone, localized tone, localized factuality) omit key aspects of native-like communication such as pragmatics, politeness strategies, register shifting, idiomaticity, code-switching, dialectal variation within locales, and multi-modal or spoken prosody.
Menlo focuses on single-turn, long-form responses; native-like quality in multi-turn conversations (e.g., turn-taking, repair, grounding, alignment over time) is not addressed.
Romanized variants for non-Latin-script languages may distort authenticity; the impact of romanization on perceived “native-like” quality and judge reliability is unexplored.
Parametric, English-authored templates localized by native speakers may introduce subtle translationese or template-induced style constraints; the effect of prompt templating on naturalness and diversity of language use is not analyzed.
Locale instantiation (e.g., holidays, entities) is narrow and may not capture the broader range of sociolinguistic audience design phenomena (e.g., intergroup identity marking, slang, in-group humor, indirectness).
The dataset’s topical coverage and domain breadth are not characterized; it is unclear how well judges generalize across topics (news, technical, humor, health) or genres (narrative vs. expository).
Only long-form text is evaluated; native-like quality for short-form tasks, mixed-form outputs (lists, tables), and multi-modal contexts remains untested.
Annotation uses 1–5 Likert ratings aggregated by majority vote, discarding rater heterogeneity and uncertainty; pluralistic alignment and label distribution modeling over multiple valid ratings per item is not explored.
Cross-language equivalence of rubrics is assumed; there is no analysis of whether rubric criteria are semantically and culturally isomorphic across languages and locales (risk of cross-cultural construct drift).
The frequency and handling of ties and near-ties are not analyzed; their relation to inter-annotator agreement and preference inference reliability is unclear.
Expert “gold” labels are used when available and otherwise averaged annotations; the impact of this mixed labeling strategy on judge training and evaluation consistency is not quantified.
Metrics emphasize Macro-F1 (5-way classification) and preference accuracy inferred from grades; calibration (e.g., ECE, Brier), rank correlation, and reliability under label noise are not evaluated.
Pairwise evaluation improves results, but listwise (3+ responses) or tournament-style comparisons—potentially stronger anchors—are not tested.
The source of cross-language performance variance (script, morphology, resource level, typological features, training data exposure) is not investigated; drivers of poor performance in certain locales remain unclear.
Localized factuality is the weakest dimension; retrieval, search, tool-use, and evidence attribution are suggested but not implemented or evaluated within Menlo’s judge or policy training.
The paper uses responses generated by a limited set of frontier models; generalization of judges to responses from unseen or weaker models, human-written responses, or adversarial/hard negatives is unknown.
Judges are largely trained and evaluated on the same distribution induced by Menlo prompts; robustness to distribution shift (new prompt styles, domains, and audiences) is not assessed.
RL judges may learn stylistic caricatures of “native-like quality,” as suggested by overestimation vs. human ratings; concrete methods to calibrate judges to human magnitude of improvement are not presented.
Chain-of-thought quality and faithfulness of judge reasoning are not evaluated; the extent to which reasoning aligns with rubrics or introduces spurious justifications is unanswered.
Positional bias mitigation is limited to AB/BA augmentation; no quantitative analysis of residual positional bias or more rigorous debiasing strategies is provided.
Reward design targets final grades and relative preferences but not reasoning consistency, rubric adherence, or evidence use; richer reward shaping (e.g., rubric-anchored sub-rewards, factual verification) remains unexplored.
Post-training as generative RM is demonstrated on a single small policy model (Qwen3-4B) and excludes localized factuality; scalability to larger models, other architectures, and all dimensions is untested.
Human validation covers only 10 high-resource languages; the gap between LLM and human raters for mid/low-resource languages and non-Latin scripts is unknown.
The absolute magnitude of LLM-judged improvements is systematically higher than human-judged improvements; methods for judge calibration, debiasing, and human-in-the-loop adjustment are not implemented.
Using judges as reward models invites potential reward hacking; safeguards against policy gaming of rubric-like artifacts and superficial stylistic cues are not studied.
Cost, latency, and scalability trade-offs of pairwise judging vs. pointwise (and potential listwise) in production settings are not measured.
The effects of few-shot exemplars in pairwise setups are minimal for one model; the general role of demonstration selection, diversity, and locale-specific exemplars is not explored.
Cross-lingual transfer is weak when training judges on English-only data; strategies for multilingual curriculum learning, language-adaptive pretraining, or parameter-efficient fine-tuning are not examined.
Dataset size per language is modest; sample efficiency, data augmentation, and active learning to target underperforming languages or dimensions are not evaluated.
Ethical and fairness aspects are not analyzed: possible judge biases against certain dialects, orthographies, or culturally marked styles are unaddressed; no fairness or harm audits are reported.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be adopted now, leveraging Menlo’s dataset, framework, and judge training methods.

Multilingual LLM quality gates for product releases
- Sector: Software, Platform AI
- What: Integrate Menlo-style pairwise judges (e.g., Qwen3-4B-RL-Judge, Llama4-Scout-RL-Judge) into CI/CD to block regressions on fluency, tone, localized tone, and localized factuality across locales.
- Tools/Workflow: Pairwise evaluation template; rubric-guided prompts; locale heatmaps; automated Win/Loss/Tie dashboards; periodic human spot checks.
- Assumptions/Dependencies: Judges overestimate improvements vs humans; require human calibration loops; coverage limited to 47 language varieties.
Localization QA and cultural tone audits
- Sector: Marketing, Localization, Media
- What: Use Menlo rubrics and audience-design prompts to audit copy for cultural appropriateness (localized tone) and naturalness before launch.
- Tools/Workflow: Parametric prompt templates with locale placeholders; pairwise judge to compare candidate phrasings; reviewer-in-the-loop signoff.
- Assumptions/Dependencies: Localized factuality is harder; consider retrieval support or local SMEs for factual checks.
Customer support chatbot tuning and monitoring
- Sector: Customer Service, E-commerce, Telecom, Utilities
- What: Post-train support bots with generative reward models (Menlo-trained judges) to improve polite, native-like tone and region-specific phrasing.
- Tools/Workflow: GRPO-based post-training with pairwise scoring; A/B evaluation via LLM judges; human QA on top markets.
- Assumptions/Dependencies: LLM judge optimism; need safety guardrails; ensure locale-specific policy compliance.
Multilingual instruction-following evaluation for internal models
- Sector: Model Development, MLOps
- What: Replace pointwise scoring with pairwise evaluation to increase reliability when ranking generations during RLHF or DPO data curation.
- Tools/Workflow: Pairwise templates; reward shaping (binary + smoothing + preference bonus); listwise experiments as future enhancement.
- Assumptions/Dependencies: Longer contexts increase latency/cost; positional bias must be mitigated via order augmentation.
Reference-free translation/localization quality estimation (QE)
- Sector: Localization, Translation Tech
- What: Use judges to rank and filter outputs from MT/LLM systems without human references, focusing on naturalness and localized tone.
- Tools/Workflow: Pairwise judge scoring and preference aggregation; rubric prompts for fluency vs tone vs locale; batch QE pipelines.
- Assumptions/Dependencies: Not a substitute for factual accuracy in regulated content; add retrieval or human review for critical facts.
Standardized rubric packs and prompt libraries for locale style calibration
- Sector: Content Ops, Education, Enterprise Communications
- What: Deploy Menlo rubrics and audience design templates to guide generation toward native-like style across locales in email, docs, and marketing.
- Tools/Workflow: Template library with placeholders (e.g., [locale_holiday], [locale_nationality]); pre-flight “Style & Tone” checker.
- Assumptions/Dependencies: Rubrics must be adapted for domain-specific tone (legal, medical); norms evolve over time.
Human annotation operations playbook
- Sector: Data Labeling, Research Ops
- What: Reuse Menlo’s guidelines, screening tests, expert annotator model, and reconciliation workflow to increase IAA on multilingual labeling tasks.
- Tools/Workflow: Custom annotation interface with Likert + preference; language-specific expert feedback; gold-set spot checks.
- Assumptions/Dependencies: Requires native annotators by locale; budget for expert reviews.
Safety, sensitivity, and reputation risk checks through localized tone
- Sector: Trust & Safety, PR, Policy
- What: Detect culturally insensitive phrasing, inappropriate politeness levels, or register mismatches before publication.
- Tools/Workflow: Pairwise judges focused on tone/localized tone; red-team prompts in local context; escalation to human reviewers.
- Assumptions/Dependencies: Judges are not safety models; combine with toxicity/harms classifiers and legal review.
Marketing message A/B testing across regions
- Sector: Marketing Tech
- What: Rapidly down-select ad copy variants using pairwise judges that better capture native-like appeal than pointwise scoring.
- Tools/Workflow: Variant generator; pairwise judge re-ranking; human validation on finalists; uplift tracking after deployment.
- Assumptions/Dependencies: Alignment between “native-like tone” and conversion may vary; keep experimental controls.
Multilingual pedagogy aids and writing assistants
- Sector: Education, Productivity
- What: Writing assistants that critique tone/fluency against Menlo rubrics and suggest locale-appropriate rewrites.
- Tools/Workflow: Inline feedback with rubric references; pairwise candidate comparisons; “why” rationales from judge reasoning.
- Assumptions/Dependencies: Avoid prescriptive monoculture; allow user-configurable style targets.
Developer tooling for LLMOps
- Sector: Software Tooling
- What: Ship a “MenloJudge” plugin for evaluation pipelines (Hugging Face datasets/models) and a “Native-like Scorecard” CLI.
- Tools/Workflow: HF dataset: facebook/menlo; judge checkpoints; template pack; CI gate; locale performance heatmaps.
- Assumptions/Dependencies: License and data governance; compute costs for pairwise inference.
Domain adaptation with generative reward models (small models)
- Sector: SMB/On-Device AI
- What: Improve 4–8B models in target locales via post-training with Menlo judges to reach acceptable tone/fluency when large APIs are costly.
- Tools/Workflow: GRPO post-training; 8-rollout sampling; averaging pairwise rewards; periodic human QA on priority markets.
- Assumptions/Dependencies: Gains are real but smaller when validated by humans; monitor for “style caricature.”

Long-Term Applications

Below are applications that are promising but need further research, scaling, or integration work.

Multilingual AI quality standards and certification
- Sector: Policy, Standards Bodies, Procurement
- What: Institutionalize Menlo-like dimensions and rubrics as acceptance criteria for public-sector AI (e.g., citizen services chatbots).
- Dependencies: Broader language/variant coverage; standardized human calibration; legal/regulatory endorsement.
Judge-in-the-loop public service agents across locales
- Sector: Government, Healthcare, Social Services
- What: Deploy real-time judges to gate responses for tone and localized factuality before end-user delivery.
- Dependencies: Low-latency inference; robust retrieval for local facts; audit logs; safety and privacy compliance.
Retrieval-augmented localized factuality judges
- Sector: News, Finance, Health, Travel
- What: Combine Menlo judges with search/knowledge graphs to reliably check locale-specific facts (e.g., holidays, regulations, addresses).
- Dependencies: High-quality, locale-curated KBs; attribution standards; disinformation defenses.
Dynamic rubric induction and adaptive evaluation
- Sector: Research, Tooling
- What: LLMs generate context-specific rubrics on the fly, narrowing the gap between pairwise and pointwise scoring and reducing human rubric authoring.
- Dependencies: Reliable rubric generation; stability across topics; human calibration workflows.
Listwise and multi-criteria evaluation for generation ranking
- Sector: Model Training, Recommenders
- What: Extend pairwise to listwise judging to select from many candidates; jointly optimize fluency, tone, localized tone, and factuality.
- Dependencies: Efficient inference; robust aggregation; diminishing returns analysis.
Personalized audience-design generation
- Sector: Consumer Apps, CRM, Education
- What: Move from locale-level “native-like” to persona-specific style control (age, profession, region), guided by audience-design principles.
- Dependencies: Privacy-preserving personalization; bias/fairness audits; consent management.
Multilingual fairness and bias audits via tone profiling
- Sector: Trust & Safety, Compliance
- What: Systematically detect uneven politeness/register across demographic groups or dialects; remediate with targeted post-training.
- Dependencies: Representative datasets; sociolinguistic expertise; governance frameworks.
Language learning and assessment tools
- Sector: EdTech
- What: Use rubric-grounded judges as “native-like” coaches that provide graded feedback on fluency and tone, with localized examples.
- Dependencies: Pedagogical validation; adaptation to CEFR/TOEFL-like scales; explainability standards.
Regulated communications assistants
- Sector: Finance, Healthcare, Legal
- What: Draft client communications that satisfy local tone expectations and insert required disclaimers/phrasing per jurisdiction.
- Dependencies: Verified localized factuality; compliance approvals; audit trails.
Voice agents and embodied systems with locale-appropriate pragmatics
- Sector: Robotics, Automotive, Smart Home
- What: Ensure spoken interactions respect local politeness norms and cultural cues; reduce user friction and complaints.
- Dependencies: Prosody and speech synthesis alignment with textual tone; cross-modal evaluation standards.
Market research at scale across regions
- Sector: Consumer Insights
- What: Simulate and evaluate message resonance in different locales; rapidly iterate on culturally tuned campaigns.
- Dependencies: Correlation between “native-like” metrics and business KPIs; ongoing human panel validation.
On-device locale-specialized small models
- Sector: Mobile, Edge AI
- What: Ship compact models post-trained with generative RMs to provide offline, native-like interactions in target markets.
- Dependencies: Efficient training/inference; energy constraints; continual learning without drift.
Expansion to dialects, code-switching, and low-resource languages
- Sector: Global Access, Inclusion
- What: Extend Menlo to more varieties and code-switch contexts; active learning to acquire scarce annotations.
- Dependencies: Native annotator availability; ethical data sourcing; cross-lingual transfer methods beyond English-only.

Cross-Cutting Assumptions and Risks

Human calibration remains necessary: LLM judges systematically overestimate improvements relative to humans, especially on style “caricature.”
Localized factuality is the hardest dimension: expect to integrate retrieval, search, and curated local knowledge bases for reliable deployment.
Cultural norms evolve: rubrics and templates require periodic updates, regional SME input, and versioning of style guidance.
Compute and latency costs: pairwise/listwise evaluation increases inference budgets; plan for batching, caching, and sampling strategies.
Bias and coverage: current 47 varieties do not cover all locales or dialects; fairness audits and expansion plans are needed.
Reward hacking risks: monitor for metric gaming; rotate judges and maintain human-in-the-loop evaluations.

Resources to get started:

Dataset: facebook/menlo (Hugging Face)
Building blocks: Menlo pairwise evaluation template, four-dimension rubrics, RL reward shaping (binary + smoothing + preference bonus), GRPO-based training
Quick wins: Add a pairwise judge gate to your multilingual eval suite; adopt the rubric-guided prompt library for content teams; run a pilot post-training with judge-as-RM on a single priority locale and validate with human raters.

View Paper Prompt View All Prompts

Glossary

Accommodation: In sociolinguistics, adapting one’s style to align with the interlocutor. "Key mechanisms include accommodation, where speakers adapt their style to the addressee, and referee design, where speakers align with an absent reference group they wish to identify with."
Audience design: Sociolinguistic principle that speakers tailor language to the audience’s identity and expectations. "Building on mechanisms from audience design \citep{bell1984language}, we propose creating tailored prompts..."
Chain-of-Thought (CoT) reasoning: Explicit step-by-step reasoning generated before answers or judgments. "For inherently thinking models like Qwen3-4B, SFT without Chain-of-Thought (CoT) reasoning actually hurts performance"
CoT supervision: Training signals that include Chain-of-Thought rationales to guide reasoning. "This suggests that for tasks requiring reasoning, it is preferable to start RL directly when the SFT target lacks CoT supervision."
Cross-entropy loss: A standard loss function for classification that measures divergence between predicted and true distributions. "For SFT, models directly predict 5-point grades using cross-entropy loss under teacher forcing"
Few-shot: Providing a small number of labeled examples in context to guide model behavior. "Few-shot pointwise improves Macro-F1 relative to zero-shot pointwise"
Frontier API models: The most capable, cutting-edge proprietary models available via APIs. "a multi-task Llama4-Scout model trained with shaped rewards surpasses frontier API models with the strongest overall performance across 47 language varieties"
Generative reward models (RMs): Models that produce reward signals (often via judgments) to train or improve generative policies. "RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency"
GRPO: A reinforcement learning algorithm used to optimize policies via relative comparisons. "For RL, we use GRPO \citep{shao2024deepseekmath}"
In-context examples: Labeled exemplars placed in the prompt to condition model outputs. "The unexpectedly large gains over few-shot in-context examples highlight pairwise evaluation"
Instruction-tuned: Fine-tuned to follow instructions and align outputs with task directives. "Llama models are instruction-tuned and we omit the Instruct suffix for brevity."
Inter-annotator agreement (IAA): A measure of consistency among different annotators’ labels. "IAA presents Krippendorffâs~ $\alpha$ measuring inter-annotator agreement."
Krippendorff’s alpha: A reliability coefficient for assessing inter-annotator agreement, accounting for chance. "Krippendorffâs $\alpha = 0.84$ "
Likert scale: An ordinal rating scale, typically 1–5, used to measure judgments. "Responses are generated using state-of-the-art LLMs and annotated with ratings on a 1--5 Likert scale"
Listwise evaluation: Assessing multiple candidates at once to determine relative quality across a set. "Future work may investigate whether extending pairwise comparisons to a listwise evaluation of multiple responses offers additional benefits."
LLM judge: A LLM used to evaluate or grade the quality of other model outputs. "We thus evaluate the ability of LLMs to serve as judges of native-like quality responses."
Macro-F1: The F1 score averaged over classes, treating each class equally. "Macro-F1 shows 5-way classification performance"
Majority vote: Aggregating multiple annotations by selecting the label with the most votes. "with final scores aggregated via majority vote."
Mixture-of-Experts: An architecture that routes inputs to specialized sub-models (“experts”). "Llama4-Scout (Mixture-of-Experts, non-reasoning)"
Parametric prompt templates: Structured prompts with placeholders that are instantiated for specific contexts. "human-written parametric English prompt templates"
Pairwise evaluation: Scoring two responses side-by-side for the same prompt. "pairwise evaluation---where models predict scores for two responses simultaneously (without explicitly predicting preference)---significantly outperforms its pointwise counterpart."
Pairwise RL training: Reinforcement learning that uses pairwise comparisons to shape rewards and policies. "This demonstrates the promise of pairwise RL training across model families, scales, and reasoning capabilities."
Pluralistic alignment: Modeling and aligning to diverse, potentially conflicting human preferences. "Multiple annotations can be used in future work on pluralistic alignment \citep{sorensen2024roadmap}."
Policy model: The generative model being optimized via reinforcement learning. "to directly improve a policy model's proficiency."
Pointwise evaluation: Scoring a single response independently of alternatives. "pointwise evaluation is in principle sufficient: a model could assign absolute scores to individual responses without needing comparisons."
Positional bias: Systematic preference related to the ordering of presented options. "to mitigate positional bias."
Preference accuracy: Accuracy computed over Win/Loss/Tie outcomes derived from assigned scores. "Preference accuracy over Win/Loss/Tie outcomes."
Preference alignment: Training to make model outputs align with human or target audience preferences. "Our findings suggest promising directions for scalable multilingual evaluation and preference alignment."
Preference bonus: An added reward component for correctly matching the relative ordering of two items. "Preference bonus: additional $+1$ if the sign of the difference between the two predicted scores matches the label."
Referee design: Aligning style with an absent group the speaker identifies with. "referee design, where speakers align with an absent reference group they wish to identify with."
Reinforcement learning (RL): Optimizing behavior via rewards from interactions or evaluations. "We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches."
Retrieval: Using external information sources to ground or verify content. "such as incorporating retrieval, search, or external tool use, may be necessary."
Reward shaping: Modifying reward signals to guide learning more effectively. "we show that multi-task RL and reward shaping enables fine-tuning a judge that is on par with human annotators in 47 language varieties."
Reward smoothing: Granting partial reward for near-correct predictions to stabilize training. "Reward smoothing: partial reward ( $+0.5$ ) if the prediction differs by exactly one grade."
Romanized: Writing non-Latin script languages using Latin characters. "romanized versions of non-Latin script languages"
Style Axiom: A sociolinguistic principle linking intraspeaker style variation to social variation. "the Style Axiom \citep{bell1984language} states that intraspeaker variation (style) reflects interspeaker variation (social)."
Supervised fine-tuning (SFT): Training a model on labeled data to improve task performance. "We compare supervised fine-tuning (SFT) and reinforcement learning (RL)"
Teacher forcing: Training method where the model is fed gold outputs to condition its next-step predictions. "using cross-entropy loss under teacher forcing"
Typologically diverse: Spanning languages with different structural characteristics. "representing a typologically diverse set of widely used languages"
Zero-shot: Using a model without task-specific training examples. "We find that in zero-shot setting, pairwise evaluation---where models predict scores for two responses simultaneously (without explicitly predicting preference)---significantly outperforms its pointwise counterpart."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (8)

Collections

Tweets

This paper has been mentioned in 12 tweets and received 125 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages (7 likes, 0 questions)

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages (2509.26601v1)

Summary

MENLO: Evaluating and Modeling Native-like Quality Across 47 Languages

Introduction and Motivation

MENLO Dataset Construction

Evaluation of LLM Judges

Training LLM Judges: SFT vs. RL

Cross-Language and Per-Dimension Analysis

Generative Reward Models and Policy Improvement

Implementation Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

What did they find, and why does it matter?

What does this mean for the future?

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Risks

Glossary

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets

alphaXiv