Papers
Topics
Authors
Recent
2000 character limit reached

Scientific Figure Captioning

Updated 1 January 2026
  • Scientific figure captioning is a task to automatically generate precise descriptions for academic figures by integrating image, body text, and OCR data.
  • Recent advances leverage multimodal transformers and personalized systems to enhance both factual accuracy and stylistic fidelity.
  • Robust datasets and new benchmarks drive progress by challenging evaluation metrics like BLEU, ROUGE, and human-aligned ratings.

Scientific figure captioning is the task of automatically generating descriptive, informative, and contextually appropriate textual descriptions for scientific figures—such as plots, charts, diagrams, or photographic images—occurring in scholarly articles. Unlike generic image captioning, scientific figure captioning demands precise multimodal reasoning, domain-aware summarization, and stylistic fidelity to the conventions of scientific writing. The field has evolved rapidly since 2021, driven by the release of large-scale multimodal datasets, diverse model architectures, and community benchmarks that target both the factual accuracy and the communicative value of generated captions.

1. Foundations and Problem Formulation

The core objective of scientific figure captioning is to produce a caption yy for a given scientific figure, leveraging a range of available modalities:

  • The figure image VV
  • Body text referencing or describing the figure (PP)
  • Extracted text via OCR (OO), e.g., axis labels or legends
  • In some formulations, additional metadata (paper title, author, abstract), or author-specific writing style exemplars

This is commonly formalized as probabilistic sequence generation:

y=arg maxyp(yV,P,O)y = \operatorname{arg\,max}_y\,p(y \mid V, P, O)

The critical difference from vision-only captioning is the dense cross-modal alignment—up to 75% of caption tokens typically derive, verbatim or paraphrased, from body text (figure-mentioning paragraphs) or in-figure text (Huang et al., 2023, Huang et al., 25 Dec 2025).

2. Dataset Development and Benchmarks

The emergence of robust, large-scale datasets has been fundamental. Key corpora and their characteristics include:

Dataset # Figures Modalities Annotation Domain Coverage Reference
SciCap 2M (full), 133.5k (release) Image, Caption, OCR, Mention-Paragraph Author-written CS, ML (2021); Multi-domain (2023+) (Hsu et al., 2021, Huang et al., 25 Dec 2025)
SciCap+ 414k +OCR tokens, +Mention paragraph Author-written CS (Yang et al., 2023)
FigCaps-HF 133.5k Human-rated quality (400 pairs) Human feedback CS (Singh et al., 2023)
LaMP-Cap 110.8k Multimodal profile (image, caption, paragraph) Author + context profile Multi-domain (Ng et al., 6 Jun 2025)
MMSci 514k captions, 742k figures Long, graduate-level, peer-reviewed Peer-reviewed 72 disciplines (Nature Commun.) (Li et al., 2024)
MSEarth 44k Refined, context-enriched Extended by LLM + human Earth sciences (2505.20740)

SciCap and its successors emphasize structured extraction of figure images, original captions, in-context paragraphs, and in-figure OCR text. LaMP-Cap establishes the first benchmark for personalized captioning, attaching up to three profile figures (with image, caption, mention-paragraph per profile) per target. The MMSci and MSEarth datasets specifically address domain diversity and caption depth, including multi-panel figures and demanding knowledge synthesis from entire articles (Li et al., 2024, 2505.20740).

3. Model Architectures and Approaches

Captioning models have evolved from attention-based ResNet+LSTM encoders to multimodal transformers integrating state-of-the-art language and vision backbones. Principal approaches include:

Many top-performing systems treat captioning as context summarization, heavily reliant on figure-mentioning paragraphs and OCR, while state-of-the-art MLLMs have enabled direct, end-to-end, cross-modal generation with in-context or few-shot demonstrations (Huang et al., 25 Dec 2025, Ng et al., 6 Jun 2025).

4. Evaluation Methodologies

Automatic evaluation has primarily used reference-based lexical overlap metrics:

  • BLEU-n: BLEUn=BPexp(i=1nwilogpi)BLEU_n = BP \cdot \exp(\sum_{i=1}^n w_i \log p_i)
  • ROUGE-L: Based on longest common subsequence between candidate and reference
  • METEOR, CIDEr, SPICE: Additional token/content overlap and scene-graph metrics
  • BERTScore / MoverScore: Embedding-based similarity measures

Empirical studies report extremely low raw metric values on scientific figures (BLEU-4 often <1–6, ROUGE-L ≈10–30, even for best systems) due to the specialized vocabulary and diversity of author styles (Hsu et al., 2021, Huang et al., 2023, Huang et al., 25 Dec 2025, 2505.20740). Correlations of BLEU/ROUGE with human expert ratings are weak (τ<0.2\tau < 0.2). LLM-based evaluators—GPT-4/3.5 zero-shot rating based on helpfulness, informativeness, and detail given the caption and figure-mention paragraph—offer substantially higher human alignment (Kendall’s τ0.40\tau\approx 0.40 with Ph.D. raters) (Hsu et al., 2023, Huang et al., 25 Dec 2025).

Human evaluation is central for high-quality assessment, focusing on accuracy, completeness, clarity, and helpfulness. Recent benchmarks have systematically incorporated rating panels of domain experts, and hybrid pipelines now integrate LLM-judges as screening/filtering stages to augment or replace costly human annotation (Kim et al., 5 Jan 2025, Hsu et al., 2023).

5. Personalization and Multimodal Profiling

Personalization is an emerging focus, addressing the need for captions to match an author’s stylistic and domain conventions. LaMP-Cap defines a rigorous experimental protocol for personalized captioning with multimodal figure profiles—triplets of (image, caption, mention) from the same paper (Ng et al., 6 Jun 2025). Empirical results show:

  • Addition of a single profile figure (image+caption+mention) boosts BLEU-4 by 4–6 points (best case: Gemini 2.5, 0.160 → 0.209); all three profiles boost BLEU-4 by +0.074 (to 0.234).
  • Ablations reveal the most critical personalization signal is caption text (drop from BLEU-4 0.110→0.048 if removed), followed by image, then mention paragraph.
  • Gains are largest for “profile–target aligned” pairs (contextually and figure-type similar), with BLEU-4 increases exceeding +0.20 in the aligned subset.

Competing personalizing systems further refine style via prompt chaining, category-specific instructions, and few-shot style adaptation, but also observe a trade-off: optimizing for stylistic match (n-gram overlap) can reduce factual accuracy (Kim et al., 30 Sep 2025, Timklaypachara et al., 9 Oct 2025). Human preference, as measured in challenge studies, is highest for personalized outputs incorporating only a small number of style anchors (Huang et al., 25 Dec 2025).

6. System Applications and Authoring Tools

Research has begun to address the end-user needs of scientists engaged in real writing workflows. The SciCapenter system provides a practical authoring interface: it analyzes draft PDFs, runs multi-aspect quality checks (helpfulness, key takeaway, OCR mention, visual reference, etc.), generates multiple draft captions, and supplies LLM-based star ratings and rationales to guide revision. User studies with Ph.D. authors confirm that such tooling can reduce cognitive workload, especially under time constraints, and that actionable feedback (e.g., which content elements are missing) is more valuable than directly generated candidate texts (Hsu et al., 2024).

Key insights from these deployments include:

  • Structured, analytical feedback (checklists, star ratings, rationales) supports decision-making and revision better than generic AI-generated texts.
  • User interfaces should expose missing content elements and encourage active author engagement; overly rigid checklists or missing actionable suggestions can frustrate users.
  • Domain-adapted templates and example galleries may further improve adoption.

7. Limitations, Open Challenges, and Future Directions

Scientific figure captioning remains an unsolved, open problem in several respects:

  • Contextual Incompleteness: ~19% of figures lack detectable mention paragraphs; current models are brittle without such context (Huang et al., 25 Dec 2025).
  • Evaluation: Traditional overlap-based metrics underrepresent human quality; future work should systematize LLM-based evaluation and factuality verification (Hsu et al., 2023, Kim et al., 5 Jan 2025).
  • Grounded Reasoning: Captions must blend detailed visual, textual, and sometimes knowledge-graph or article-level contextual signals—current architectures often omit mechanistic explanations or fail to resolve multi-panel distinctions (e.g., MMSci benchmark) (Li et al., 2024).
  • Personalization vs. Accuracy: Incorporating author style can trade off against factual informativeness; explicit constraints or multi-task objectives remain underexplored (Kim et al., 30 Sep 2025).
  • Author-Centric Workflows: Most studies use rewrite or proxy writing tasks; the real impact on active scientific writing is unmeasured. Longitudinal field studies and instrumented authoring environments are needed (Huang et al., 25 Dec 2025, Hsu et al., 2024).

Future directions emphasize deeper integration of multimodal context retrieval, controllable captioning for diverse user profiles (expert vs. student), hybrid editing pipelines, and new benchmarks spanning broader scientific domains and figure types.


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Scientific Figure Captioning.