SciCap Challenge 2023: Benchmarking Figure Captioning

Updated 11 August 2025

SciCap Challenge 2023 is a benchmark evaluating AI systems for generating descriptive and informative captions based on multimodal scientific data.
The challenge integrates figure images, textual references, and OCR metadata to assess model performance using both automatic metrics and expert evaluations.
Findings reveal that state-of-the-art LMMs, notably GPT-4V, surpass human caption quality while highlighting open questions in domain generality and metric adequacy.

The SciCap Challenge 2023 is a landmark evaluation and benchmarking task aimed at advancing automated generation of descriptive and informative captions for scientific figures in scholarly documents. Initiated in response to the proliferation of large multimodal models (LMMs) and increased awareness of the importance of figure accessibility, the challenge mobilized global teams to address interdisciplinary requirements—spanning computer vision, natural language processing, and academic communication—with rigorous datasets and evaluation protocols. Its findings highlight both the substantial recent progress and the open research questions surrounding caption quality, domain generality, and metric adequacy.

1. Objectives and Scope

The core objective of the SciCap Challenge 2023 was to systematically evaluate the ability of automated systems to produce high-quality captions for scientific figures across varied domains and figure types. Models were required to synthesize visual content (figure images), associated textual context (referencing paragraphs), and in-figure metadata (OCR text). The challenge’s framing explicitly targeted the intersection of image captioning and document understanding, requiring outputs that not only describe but also interpret and contextualize figure content as it appears in scholarly articles.

An expanded version of the SciCap dataset was provided: 476,389 single-panel figures from arXiv articles (2010–2020), covering Computer Science, Mathematics, Physics, Statistics, Economics, Electrical Engineering and Systems Science, Quantitative Biology, and Quantitative Finance. Each instance comprised the figure image, author-written caption, paragraphs mentioning the figure, and OCR-extracted in-figure text, with figures classified into node diagrams, equations, graph plots, scatterplots, and bar charts using FigureSeer.

2. Datasets and Evaluation Metrics

The challenge dataset operationalized multimodal captioning: participants had access to images, captions, OCR tokens, and referencing paragraphs. The dataset was meticulously filtered to ensure quality—restricting figures to single panels, aligning train/validation/test splits at the document level, and normalizing caption texts.

Evaluation metrics included:

BLEU-4: Standard n-gram overlap.
ROUGE-1, ROUGE-2, ROUGE-L: Measures capturing recall-oriented overlap with human captions; normalized to mitigate length differences.
Human Ranking: Professional editors with technical writing experience directly compared outputs from each system under varied length constraints.

This dual strategy enabled explicit comparison between automatic metrics and expert judgment, with normalization methodologies (e.g., Random(length) adjustments for ROUGE) designed to reduce bias.

3. Model Performance and Findings

Teams employed state-of-the-art LMMs, fine-tuned summarization models (e.g., Pegasus), and custom architectures leveraging multimodal fusion. Notable systems included open LMMs such as UniChart and winning submissions from NJUST and USTC.

A key experimental outcome was the consistent expert preference for captions generated by GPT-4V, which utilized both figure images and referencing paragraphs. Both unconstrained and length-constrained generations from GPT-4V were rated higher than alternative model outputs and—even more strikingly—higher than the original human-written captions supplied by article authors. Reasons cited included sufficiency of detail and clarity of takeaway message, reflecting superior contextualization and informativeness.

Automatic metrics sometimes favored summarization-based systems, but failed to fully capture the qualitative aspects identified by human reviewers. These findings suggest that standard metrics retain limitations in assessing nuanced quality, particularly regarding informativeness and synthesis.

4. Analysis and Open Questions

The challenge’s multi-phase studies revealed several insights:

Domain and Figure-Type Variance: Performance distributions varied across different academic fields and figure types. Some models excelled in domains like Computer Science but underperformed in fields with more diagrammatic or mathematical content.
Metric Correlation: There was a disconnect between BLEU/ROUGE scores and human preferences, particularly when captions were verbose or lacked focus on figure takeaways.
Generality of LMMs: LMMs maintained superior performance when adapting to unseen papers (published after model training cut-offs), supporting claims of robust domain generalization but also raising questions about potential data contamination and generality to fields not included in arXiv.

A plausible implication is that advanced LMMs, particularly GPT-4V, now set the practical state-of-the-art in scientific figure captioning. However, the task is not “solved”—progress in metrics, domain customization, and factual accuracy remains necessary.

5. Practical and Scientific Impact

The SciCap Challenge 2023 substantially advances the capabilities available to researchers, digital library curators, and academic publishers:

Research Communication: Automated systems now provide captions that meet or exceed the clarity and informativeness of human authorship, with potential to improve manuscript accessibility and understanding.
Indexing and Retrieval: Rich, context-aware captions facilitate semantic search and figure retrieval in scholarly databases.
Accessibility: Enhanced captions directly support visually impaired readers and non-experts, making technical content more inclusive.
Efficiency in Scholarly Writing: Caption suggestion systems can assist authors, reducing the cognitive burden associated with figure description and interpretation.

6. Future Directions

Research directions highlighted include:

Improving Metric-Quality Alignment: New evaluation protocols are required to bridge the gap between expert judgment and automatic scores, capturing aspects such as clarity of inference and informativeness.
Customization for Use Cases: Adaptive captioning systems that tailor content for different audiences are needed, balancing brevity and detail.
Expanding Domain Coverage: Validation in non-arXiv domains (e.g., medicine, biology) and addressing data contamination risks for closed-source LMMs are critical.
Integration with HPC and Data-Driven Methods: As discussed in the DOE/NSF Correctness workshop (Gokhale et al., 2023), attention to correctness and reproducibility in ML-augmented workflows is increasingly important for scientific validity.

7. Concluding Remarks

The SciCap Challenge 2023, and its accompanying benchmarking framework, mark a pivotal step in automated scientific figure captioning. The dominance of LMMs—particularly GPT-4V—in expert evaluation signals a rapid closing of the gap between human and machine performance. However, continuing challenges in evaluation, quality control, and domain extension ensure this remains a dynamic research area with ongoing need for interdisciplinary innovation and assessment.

PDF Markdown Chat (Pro)

References (1)

Report of the DOE/NSF Workshop on Correctness in Scientific Computing, June 2023, Orlando, FL (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SciCap Challenge 2023.