JMMMU-Pro Benchmark Overview
- JMMMU-Pro is an image-based Japanese multimodal benchmark that integrates visual and textual data into a single composite image for realistic LMM evaluation.
- It employs a scalable Vibe Benchmark Construction methodology using generative models with minimal human intervention to ensure high quality and diversity.
- Evaluation reveals that open-source LMMs experience notable drops in accuracy compared to closed-source models, exposing challenges in OCR fidelity and integrated reasoning.
JMMMU-Pro is an image-based Japanese Multi-discipline Multimodal Understanding benchmark designed to evaluate language-and-vision models (LMMs) on integrated visual-textual reasoning in Japanese. Inheriting its task taxonomy from JMMMU, which supplied question text and images separately, JMMMU-Pro synthesizes them into a single composite image to emulate real-world settings requiring perception, text recognition, and reasoning in a unified visual stream. Its construction leverages a scalable methodology ("Vibe Benchmark Construction") that utilizes generative models to automate dataset creation with minimal human intervention while maintaining high quality and diversity. JMMMU-Pro provides a realistic and rigorous benchmark for assessing the multimodal capabilities of LMMs in Japanese and exposes critical bottlenecks, particularly in open-source models, compared to their closed-source counterparts (Miyai et al., 16 Dec 2025).
1. Motivation and Benchmark Evolution
JMMMU-Pro addresses key limitations of existing Japanese VQA benchmarks and extends the paradigm shift introduced by MMMU-Pro in English (&&&1&&&). The original JMMMU separated question text and image inputs, but this format does not capture the complexity introduced when all information—visuals, Japanese text, and candidate choices—is embedded into a single image. In practical settings, such as screenshots, this requirement mirrors user interaction and demands an LMM perform simultaneous perception and interpretation of visual and textual content. The benchmark contains 1,320 items (720 culture-agnostic, 600 culture-specific) spanning 28 college-level disciplines and is constructed so that text and graphics are truly inseparable.
JMMMU-Pro is the first large-scale Japanese benchmark embedding text and graphics into one image, offering a direct analog to the English MMMU-Pro but targeting Japanese-specialized LMM architectures and training regimes. Its goals are to:
- Evaluate open-source LMMs on complex, university-level Japanese questions where text and image are fused,
- Embed the same JMMMU items into visually and structurally diverse backgrounds and layouts,
- Provide scalable, reproducible dataset construction via prompt-driven generative pipelines.
2. Vibe Benchmark Construction Methodology
Dataset creation follows the Vibe Benchmark Construction methodology. A high-quality image generative model (Nano Banana Pro, accessed through the gemini-3-pro-image-preview API) composes composite images incorporating illustration, Japanese question text, and multiple-choice options. Human annotators intervene only to verify correspondence, layout, legibility, and content integrity, thus reducing manual labor compared to traditional approaches.
The prompt parameterization is formalized as a sampling from a discrete set of six layout and visual variables:
- background ∈ {workbook, exam_sheet, whiteboard, blackboard, projector, iPad_notebook, webpage, Nintendo_Switch, TV_quiz_show}
- background_color ∈ {white, light_green, light_yellow, light_pink, light_gray, light_blue}
- font ∈ {handwritten, computer_thick, computer_thin, manga_style, …}
- margin ∈ {small, large}
- state ∈ {photo_smartphone, screenshot_PC, screenshot_smartphone}
- aspect_ratio ∈ {9:16, 16:9, 3:4, 1:1}
Each composite image is generated at 1024×1024 resolution, with sampled uniformly from the prompt parameter set . Human verification ensures (i) text matches precisely, (ii) original content is faithfully embedded, (iii) legibility exceeds 99%, and (iv) layout is natural. On the first round, 71% of images are accepted; most failures are corrected by prompt tuning. Hard-to-generate cases (about 5%)—such as including complex formulas—are manually constructed outside the automated pipeline.
3. Dataset Composition and Structure
Each item in JMMMU-Pro is a single PNG containing:
- The original illustration or diagram,
- The Japanese question text (sometimes with explicit
<image>tags), - Four multiple-choice options labeled ア, イ, ウ, エ.
The layouts are derived from the uniformly sampled prompt variables, controlling background type, font, color, margin, screenshot or photo status, and aspect ratio. This produces substantial diversity:
| Statistic | Value/Distribution |
|---|---|
| Number of questions | 1,320 (600 culture-specific; 720 culture-agnostic) |
| Disciplines | 28 total (e.g., Japanese Art, Heritage, History, Science, Technology, Business, Health) |
| Backgrounds | workbook (21%), exam_sheet (20%), TV (18%), whiteboard (12%), iPad (7%), etc. |
| Colors | white (35%), light_green (16%), light_yellow (18%), others |
| Fonts | five main styles (incl. manga_style, handwritten) |
| Margin | small (45%), large (55%) |
| State | photo (50%), screenshot (50%) |
| Text legibility | >99% human-annotated reading accuracy |
This tabular summary organizes statistics explicitly defined in (Miyai et al., 16 Dec 2025).
Text embedding quality is confirmed to remain high across all visual configurations, with <1% distortion or error. Layouts preserve a natural Japanese reading order, with image focus either at the top/left and textual elements below/right, conditioned on aspect ratio and margin.
4. Evaluation Protocol and Experimental Results
Evaluation is conducted in a strict zero-shot setting with temperature = 0 and sufficiently high max_token limits. Fourteen LMMs are compared, grouped by language specialization: closed-source (GPT-5.2, Gemini3Pro), multilingual open-source (Qwen3VL-8B, Qwen2.5VL-7B, Phi-4-multimodal, AyaVision-8B, Pangea-7B), English-centric open-source (LLaVA-OV-1.5-8B, LLaVA-OV-7B, InternVL2.5-8B), Japanese open-source (Sarashina2.2-Vision-3B, Sarashina2-Vision-14B, Sarashina2-Vision-8B, Heron-NVILA-Lite-15B).
The top-1 accuracy metric is employed:
where denotes the model’s answer and the ground truth.
Experimental findings demonstrate a substantial gap between closed-source and open-source models:
| Model | JMMMU-Pro Accuracy | JMMMU (separate fields) | Relative Drop | OCRAccuracy Correlation |
|---|---|---|---|---|
| Gemini3Pro | 87.04% | — | — | — |
| GPT-5.2 | 83.33% | — | — | — |
| Qwen3VL-8B | 47.27% | up to 23 pts higher | See below | |
| Qwen2.5VL-7B | 45.00% | up to 23 pts higher | — | — |
| Random baseline | 27%–28% | — | — | — |
Open-source models suffer up to a 23-point drop in accuracy vs. the traditional (text+image separate) format, underscoring the challenge of fully integrated perception.
OCR accuracy does correlate with performance (), but pairs of models (e.g., Heron-NVILA and Sarashina2.2-Vision) can have similar OCR accuracy yet diverge by more than 20% in VQA performance, indicating that text recognition is necessary but not sufficient for reasoning.
Chain-of-Thought prompting improved 7/12 open-source models on JMMMU-Pro; in contrast, only 3/12 benefited on JMMMU. Prompt effectiveness often varies on a per-model, per-dataset basis, suggesting the need for adaptive prompt tuning.
5. Error Analysis and Qualitative Insights
Observed failure modes from experimental analysis fall into two major categories:
- Perceptual errors: Small or stylized text is misread, or key diagrams are incorrectly localized. For example, text in manga or handwritten style causes misrecognition, and overlapping visual channels hinder correct answer identification.
- Reasoning errors: Even after successful text recognition, models often fail to apply necessary domain knowledge, misinfer relationships between visual and textual cues, or succumb to layout confusions.
Cases reveal that contemporary open-source LMMs require improved Japanese OCR, robust layout understanding, and domain-transfer reasoning to match human or closed-model capabilities. These insights are consistent across different visual backgrounds, fonts, and state settings, confirming that JMMMU-Pro exposes latent brittleness not evident in simpler input modalities.
6. Comparative Position and Recommendations for Research
JMMMU-Pro advances the Japanese VQA landscape beyond JDocQA and MangaVQA, which either center on document images or comic frames without higher-level reasoning demands. In contrast, JMMMU-Pro enforces college-level knowledge combined with integrated multimodal perception.
Key directions and recommendations for open-source LMM development include:
- Integrate specialized Japanese OCR or pretrain on corpora comprising Japanese text-within-image contexts,
- Jointly finetune on synthetic composites sampled from the Vibe Benchmark Construction method to increase layout robustness,
- Deploy multimodal Chain-of-Thought modules to alternate between OCR and domain reasoning,
- Expand training data to match the observed diversity of backgrounds, fonts, and layouts sampled in JMMMU-Pro.
A plausible implication is that no single module (OCR, CoT, vision encoder) suffices; an end-to-end approach tightly coupling perception, recognition, and inference may be necessary for robust Japanese multimodal performance.
7. Illustrative Examples and Benchmark Impact
Typical JMMMU-Pro samples illustrate the benchmark’s demand for integrated reasoning:
- A whiteboard-style photo (manga font, large margin) shows a map of Edo-period Japan; the question queries the year Tokugawa Ieyasu was appointed as shōgun, with four year choices.
- A simulated TV quiz show layout presents a sinusoidal graph and the question “What is the period of sin(x)?” with standard mathematical choices (π, 2π, etc.).
These examples emphasize the richness of layouts, background variation, and the necessity for joint reading and visual inference. Consequently, JMMMU-Pro provides a challenging, realistic standard for LMM assessment in Japanese and offers a reproducible pathway—via the Vibe Benchmark Construction approach—for future large-scale multimodal VQA dataset creation (Miyai et al., 16 Dec 2025).