Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

OpenCompass Multi-modal Academic Benchmark

Updated 1 July 2025

OpenCompass is a comprehensive benchmark that evaluates multi-modal models using unified instruction-tuned datasets for academic tasks.
It standardizes tasks by converting complex datasets into clear (instruction, input, response) triplets, covering recognition, generation, and document analysis.
The platform enhances research by enabling reproducible evaluation and robust multi-stage instruction tuning, advancing AI in scholarly domains.

The OpenCompass Multi-modal Academic Benchmark is a comprehensive, open-source evaluation platform designed to systematically assess multi-modal LLMs (MLLMs) across a spectrum of academic and real-world vision-language tasks. Built around the Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark (LAMM) (2306.06687), OpenCompass provides standardized datasets, instruction formats, and benchmark procedures, enabling reproducible and fair comparison of MLLM capabilities. Its primary focus is to facilitate the rigorous training, extension, and evaluation of academic AI agents that bridge the gap between human ideas and machine execution, particularly in scientific and technical domains where multi-modal reasoning is essential.

1. Dataset Composition and Task Coverage

OpenCompass, through the LAMM benchmark, aggregates and reformulates a wide set of established academic vision benchmarks. The covered task types include:

Recognition and Understanding: Encompassing visual classification, detection, visual grounding, referring expression comprehension, and visual reasoning tasks.
Multi-modal Generation: Addressing tasks such as image captioning, multi-image storytelling, and visual dialog.
Document and Scientific Tasks: Integrating text and tabular understanding, OCR, chart and diagram interpretation, and science Q&A.

Key datasets instruction-tuned and unified within OpenCompass include VQA, OK-VQA, ScienceQA, TextVQA, ChartQA, RefCOCO, SEED, MMVet, and academic OCR tasks. Each sample is reformulated into an (instruction, input, response) triplet, with instructions explicitly crafted to simulate human-like, task-specific queries. Both 2D and 3D vision problems are represented, ensuring diverse coverage across the academic landscape.

2. Instruction-Tuning Methodology and Benchmark Construction

The instruction-tuning process underlying the OpenCompass benchmark systematically converts samples from academic datasets as follows:

Task Paraphrasing: Each academic challenge is standardized using a multi-modal instruction template, harmonizing intent and structure.
Triplet Creation: For each example, an (instruction, input, response) set is generated:
- Instruction mimics human prompts, specifying the information sought.
- Input includes the multi-modal content (e.g., image and associated text).
- Response is the expected answer or explanation.
Quality Assurance: Manual and automated validation ensures instructions are clear and responses are accurate, reducing label noise and ambiguity. Human inspection and heuristic filtering scripts are routinely applied.
Benchmark Assembly: The evaluation benchmark composes the test splits from these datasets into a unified prompt-based format, supporting direct evaluation of any compatible MLLM.

This methodology provides comprehensive coverage—from straightforward recognition to complex, open-ended reasoning—in an academically relevant, instruction-following format.

3. Framework and Model Training Infrastructure

The OpenCompass–LAMM framework offers a flexible training and evaluation infrastructure for MLLMs, designed with modality extension and ease of adoption in mind. Core architectural elements include:

Multi-stage Instruction Tuning: Pre-training on large, generic instruction-tuning multi-modal data, followed by task-specific fine-tuning to academic targets.
Flexible Input Pipeline: Support for arbitrary combinations of image, text, and tabular input, using modality-specific tokens (e.g., <image>, <ocr>).
Modality Fusion: Employs lightweight adapters or projectors for cross-modal alignment, with losses such as cross-entropy for text generation and contrastive loss for multi-modal representation fusion:

$\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\text{CE}} + \lambda_2 \mathcal{L}_{\text{contrastive}}$

Extended to balance generation and alignment per-sample:

$\mathcal{L}_{\text{LAMM}} = \sum_{i=1}^N \left( \alpha \cdot \mathcal{L}_{\text{gen}}^{(i)} + (1-\alpha) \cdot \mathcal{L}_{\text{align}}^{(i)} \right)$

Optimized Training: Framework supports vision-language backbone pairings (e.g., ViT+Llama, CLIP+Vicuna) with baseline models benchmarked on both A100 and competitive consumer GPUs.

Typical training involves 8–16 A100 GPUs (40GB each) for 7B–13B LLM backbones, batch sizes of 32–128, and runs of 30–50K steps, using AdamW optimization and cosine learning rate scheduling.

4. Experimental Observations and Analytical Insights

Experimental validation using OpenCompass–LAMM demonstrates:

Performance Advantage: Models instruction-tuned on the benchmark consistently outperform those tuned on generic or synthetic conversational datasets. For example, accuracy improvements of 5–10% on tasks such as VQA and ScienceQA are observed.
Generalization and Robustness: The benchmarked MLLMs show superior robustness to prompt variance and improved transferability to previously unseen academic tasks.
Instruction Sensitivity: Clarity and precision of instructions have a measurable impact; ambiguous or poorly formulated prompts degrade model performance substantially.
Error Analysis: Principal failure modes include poor alignment between modalities and misinterpretation of complex visual-textual relationships, highlighting the need for more advanced multi-hop and cross-modal reasoning approaches.

A sample illustrative result table (from the benchmark):

Task	General MLLM	LAMM-tuned
VQA	66.5%	72.8%
ScienceQA	56.3%	64.0%
TextVQA	65.8%	71.2%

This demonstrates the discriminatory power and relevance of the OpenCompass evaluation approach.

5. Applications and Future Directions

OpenCompass is positioned as a universal, extensible benchmark for MLLM research and deployment in academic contexts. Key applications include:

Academic Model Benchmarking: Standardizing the evaluation of vision-LLMs on scientific and technical tasks.
Instruction Tuning Analysis: Enabling rigorous paper of instruction format, clarity, and style effects in multi-modal learning.
Advanced Reasoning Research: Providing an experimental testbed for the paper of complex multi-modal, multi-hop, and cross-modal scientific reasoning.
Ecosystem Extension: The framework is explicitly designed to facilitate rapid extension to additional modalities (audio, video, 3D), novel tasks, and richer dialogue agents.

Prospective directions include support for step-wise, chain-of-thought reasoning and real-world academic competition tasks ("Olympiad"-style), as well as richer, multi-turn, research-level dialogues.

6. Contribution to the Field and Benchmarking Best Practices

By standardizing dataset construction, evaluation prompts, and instruction-following protocols, OpenCompass through LAMM establishes a reproducible and extensible platform for academic MLLM evaluation. The introduction of unified multimodal instruction-tuning datasets and benchmarks has accelerated research progress, improved comparability of new models, and provided a focal point for community collaboration in academic AI. The benchmark's demonstrated ability to reflect model strengths and weaknesses across a diversity of real-world academic tasks has led to its adoption as a reference standard for multi-modal evaluation.

7. Example Data Structure and Evaluation Formula

A typical item in the benchmark adheres to the following structure:

1
2
3

Instruction: "Given the image and associated graph, what is the main scientific conclusion?"
Input: <image> <chart>
Response: "The chart shows a strong positive correlation between sunlight and plant growth rate."

Core formula for evaluation (accuracy):

$\text{Accuracy} = \frac{\text{Number of Correctly Answered Questions}}{\text{Total Number of Questions}}$

This structure supports direct, fair, and comprehensive evaluation of multi-modal academic models.

OpenCompass, as anchored by the LAMM dataset and framework, represents a principal open academic benchmark for advancing the training, evaluation, and deployment of MLLMs in scholarly and STEM-specific domains. Its rigorous methodology, extensibility, and empirical effectiveness underpin ongoing efforts to bridge the gap between human-level academic reasoning and artificial intelligence.

PDF Markdown Chat (Upgrade)

References (1)

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark (2023)