Mobile-MMLU Benchmark for Mobile Language Models
- Mobile-MMLU is a benchmark designed to evaluate mobile language models by integrating accuracy with mobile-centric metrics like latency, energy, memory, privacy, and adaptability.
- It presents 16,186 multiple-choice questions across 80 real-world mobile topics, ensuring evaluations reflect practical, on-device AI constraints.
- The benchmark includes Mobile-MMLU-Pro, a challenging subset that stresses models on tight resource limits, rapid response, and real-time performance requirements.
Mobile-MMLU is a large-scale benchmark specifically designed to evaluate and advance the performance of LLMs under the operational constraints characteristic of mobile devices. It addresses the distinct requirements of on-device AI, capturing both the practical demands of real-world mobile use cases and the technical limitations imposed by mobile hardware. Mobile-MMLU and its harder subset, Mobile-MMLU-Pro, set a new standard for benchmarking language intelligence on edge devices by integrating mobile-centric metrics such as latency, energy usage, memory footprint, privacy, and adaptability, in addition to classical accuracy measures (Bsharat et al., 26 Mar 2025).
1. Motivation and Distinguishing Characteristics
Traditional LLM benchmarks such as MMLU, HELM, GLUE, and SuperGLUE primarily assess general or academic knowledge, focusing on tasks suited to server-side or desktop computation. These benchmarks assume plentiful hardware—large RAM, powerful CPUs/GPUs, and negligible latency/power constraints. In contrast, mobile scenarios feature:
- Distinct usage patterns: Situational and “in-the-moment” queries, streamlined for mobile screens, involve tasks such as travel planning, recipe suggestions, and digital troubleshooting that differ fundamentally from desktop-server usage.
- Resource constraints: Models are typically ≤1–3 GB due to tight storage and RAM limits. Most real-world deployments rely on aggressive quantization (4-/8-bit) and must deliver end-to-end inference within 100–500 ms to satisfy real-time user expectations.
- Power/energy sensitivity: Battery limits necessitate inference energy costs in the millijoule range.
- Privacy and personalization: Stringent requirements for on-device processing, prohibiting cloud-based data transmission, and prioritization of adaptive, user-specific answers.
Mobile-MMLU directly responds to these demands, measuring not only task accuracy but also core mobile-relevant properties. It spans 80 “everyday” mobile domains—far broader than prior benchmarks—and quantifies parameters critical to both researchers and practitioners optimizing LLMs for edge deployment (Bsharat et al., 26 Mar 2025).
2. Dataset Composition and Construction Process
Core Statistics
Mobile-MMLU consists of 16,186 multiple-choice, order-invariant questions covering 80 topics grouped into 9 categories: Academic Learning, Business Career, Technology Digital, Health Safety, Lifestyle Personal, Home Family, Culture Society, Environment, and Miscellaneous. Mobile-MMLU-Pro, the challenging subset, contains 9,497 questions and is calibrated for discriminatorily testing model limits under mobile constraints.
| Benchmark | #Topics | #Questions |
|---|---|---|
| MMLU | 57 | 15,573 |
| MMLU-Pro | 14 | 12,102 |
| Mobile-MMLU | 80 | 16,186 |
| Mobile-MMLU-Pro | 80 | 9,497 |
Construction Pipeline
- Field Selection: Topics were sourced from WikiHow, Stack Exchange, Reddit, and LLM-driven suggestions (GPT-4O, O1-preview), with emphasis on practical, mobile-relevant activities (e.g., “How to pair phone with car Bluetooth?”).
- Question Generation: Both straightforward (ordinary) and multi-step reasoning (complex) questions are drafted using LLMs and refined by human reviewers.
- Ground Truth and Distractors: Each question’s correct answer is paired with 3–5 length-matched incorrect choices to avoid selection bias toward option length.
- Similarity Filtering: Duplicates are removed using all-mpnet-base-v2 embeddings with cosine similarity below 0.98. The similarity metric is given by:
- Human-AI Verification: Relevance and option validity are collaboratively checked. Multi-correct cases are retained only if at least two expert LLMs (GPT-4O, Claude-3.5, Gemini-2.0) reach consensus.
- Order-Invariance Checks: Option sequences, including the placement of correct answers, are randomized to guarantee model robustness to permutation (verified by <3% accuracy variance).
Mobile-MMLU-Pro excludes questions that are either routinely answered or elicit inconsistent results from both ensembles of compact, mobile-level LLMs, and strongest available LLMs, resulting in a set designed for fine-grained model discrimination (Bsharat et al., 26 Mar 2025).
3. Evaluation Metrics and Quantitative Assessment
Mobile-MMLU introduces mobile-centric metrics alongside classical accuracy:
- Accuracy: Proportion of correct answers. For multi-correct items, any ground truth answer suffices. Order invariance ensures accuracy is stable under answer shuffling.
- Inference Latency: Wall-clock duration per query, (measured in ms).
- Energy Consumption: Integrated power usage per inference:
(measured in millijoules or joules).
- Memory Usage: Peak on-device RAM consumption (MB) during model load and inference.
- Privacy Score: Fraction of inferences executed completely on-device with no cloud calls.
- Adaptability: Impact on accuracy and latency after model personalization (e.g., via few-shot tuning on private user data).
- Mobile Relevance Score (MRScore): GPT-4O judges weight each question’s practical value, mobile-friendliness, and typical usage pattern:
where , , encode respective qualitative assessments.
Combining these measures, Mobile-MMLU enables holistic assessment—rewarding models that are not only accurate but also fast, resource-efficient, privacy-preserving, and user-adaptive on mobile hardware (Bsharat et al., 26 Mar 2025).
4. Benchmarking Protocol and Reporting Standards
Experiments with Mobile-MMLU are conducted on commodity smartphones (8-core ARM CPUs, optional NPU, 6–8 GB RAM, Android/iOS platforms) using frameworks such as PyTorch Mobile and TensorFlow Lite with quantization. Standard protocol includes:
- Model cold start and peak RAM measurement.
- 10 dummy inference “warmup” iterations.
- Zero-shot evaluation across the full or Pro subset with accuracy recording.
- Simultaneous tracking of latency, energy, RAM per query.
- Option randomization for order invariance validation.
Comprehensive reporting comprises device specifications, model size, quantization scheme, threading, and measured values for all principal metrics (accuracy %, latency ms, energy mJ, memory MB, privacy %, and post-adaptation accuracy/latency) (Bsharat et al., 26 Mar 2025).
5. Comparison with Established Benchmarks
Mobile-MMLU is distinguished by both domain coverage and metric granularity relative to established alternatives:
| Aspect | MMLU / MMLU-Pro | Mobile-MMLU / Mobile-MMLU-Pro |
|---|---|---|
| Domains | Academic/STEM/humanities | 80 mobile-specific domains |
| #Topics | 57 / 14 | 80 / 80 |
| Question Length | 46.7 words (avg) | 30.8 words (avg; mobile-friendly) |
| Format | Multi-choice (4–10 options) | Multi-choice (4–5), order-invariant, multi-correct |
| Metrics Emphasis | Accuracy only | Accuracy, latency, energy, memory, privacy, adaptability |
| Difficulty Spread | Tighter, less discrimination | Wider, especially for small models |
Mobile-MMLU thus expands both the scope and discriminative power of LLM evaluation, particularly in settings prioritizing edge-device deployment (Bsharat et al., 26 Mar 2025).
6. Application Scenarios and Model Design Implications
Illustrative use cases drawn directly from the dataset include: troubleshooting Bluetooth connections, home appliance maintenance diagnostics (with multi-correct answers), and ergonomics (“What is the best way to hold a smartphone to reduce strain?”). These examples demonstrate Mobile-MMLU’s coverage of practical, task-critical queries aligned with end-user needs.
Designing models to excel under Mobile-MMLU constraints typically involves:
- Quantized, transformer-lite architectures to achieve sub-100 ms latency.
- Early-exit layers for rapid responses on simple queries.
- Privacy-preserving adapters supporting on-device personalization.
- Leveraging Mobile-MMLU-Pro to identify and remedy hard failure cases that uniquely challenge resource-constrained LLMs.
The dataset’s qualitative and quantitative depth enables benchmarking of not only generic model proficiency but also real-world readiness for mobile intelligence deployments (Bsharat et al., 26 Mar 2025).
7. Connections to Efficient Inference and MoE Advances
Mobile-MMLU serves as a critical evaluation substrate for efficient inference techniques, including recent mixture-of-experts (MoE) architectures evaluated on resource-constrained devices (Skliar et al., 2024). Cache-aware expert routing and quantized inference, when assessed via Mobile-MMLU, demonstrate substantial gains in throughput and latency—halving per-token inference times with negligible accuracy degradation by increasing DRAM cache hit rates up to 4–8× and slashing memory transfers by 80%. This aligns the benchmark’s practical intent with ongoing architectural innovations in the mobile LLM ecosystem (Skliar et al., 2024, Bsharat et al., 26 Mar 2025).