Scientific Figure Captioning
- Scientific figure captioning is a task to automatically generate precise descriptions for academic figures by integrating image, body text, and OCR data.
- Recent advances leverage multimodal transformers and personalized systems to enhance both factual accuracy and stylistic fidelity.
- Robust datasets and new benchmarks drive progress by challenging evaluation metrics like BLEU, ROUGE, and human-aligned ratings.
Scientific figure captioning is the task of automatically generating descriptive, informative, and contextually appropriate textual descriptions for scientific figures—such as plots, charts, diagrams, or photographic images—occurring in scholarly articles. Unlike generic image captioning, scientific figure captioning demands precise multimodal reasoning, domain-aware summarization, and stylistic fidelity to the conventions of scientific writing. The field has evolved rapidly since 2021, driven by the release of large-scale multimodal datasets, diverse model architectures, and community benchmarks that target both the factual accuracy and the communicative value of generated captions.
1. Foundations and Problem Formulation
The core objective of scientific figure captioning is to produce a caption for a given scientific figure, leveraging a range of available modalities:
- The figure image
- Body text referencing or describing the figure ()
- Extracted text via OCR (), e.g., axis labels or legends
- In some formulations, additional metadata (paper title, author, abstract), or author-specific writing style exemplars
This is commonly formalized as probabilistic sequence generation:
The critical difference from vision-only captioning is the dense cross-modal alignment—up to 75% of caption tokens typically derive, verbatim or paraphrased, from body text (figure-mentioning paragraphs) or in-figure text (Huang et al., 2023, Huang et al., 25 Dec 2025).
2. Dataset Development and Benchmarks
The emergence of robust, large-scale datasets has been fundamental. Key corpora and their characteristics include:
| Dataset | # Figures | Modalities | Annotation | Domain Coverage | Reference |
|---|---|---|---|---|---|
| SciCap | 2M (full), 133.5k (release) | Image, Caption, OCR, Mention-Paragraph | Author-written | CS, ML (2021); Multi-domain (2023+) | (Hsu et al., 2021, Huang et al., 25 Dec 2025) |
| SciCap+ | 414k | +OCR tokens, +Mention paragraph | Author-written | CS | (Yang et al., 2023) |
| FigCaps-HF | 133.5k | Human-rated quality (400 pairs) | Human feedback | CS | (Singh et al., 2023) |
| LaMP-Cap | 110.8k | Multimodal profile (image, caption, paragraph) | Author + context profile | Multi-domain | (Ng et al., 6 Jun 2025) |
| MMSci | 514k captions, 742k figures | Long, graduate-level, peer-reviewed | Peer-reviewed | 72 disciplines (Nature Commun.) | (Li et al., 2024) |
| MSEarth | 44k | Refined, context-enriched | Extended by LLM + human | Earth sciences | (2505.20740) |
SciCap and its successors emphasize structured extraction of figure images, original captions, in-context paragraphs, and in-figure OCR text. LaMP-Cap establishes the first benchmark for personalized captioning, attaching up to three profile figures (with image, caption, mention-paragraph per profile) per target. The MMSci and MSEarth datasets specifically address domain diversity and caption depth, including multi-panel figures and demanding knowledge synthesis from entire articles (Li et al., 2024, 2505.20740).
3. Model Architectures and Approaches
Captioning models have evolved from attention-based ResNet+LSTM encoders to multimodal transformers integrating state-of-the-art language and vision backbones. Principal approaches include:
- Vision-to-Language Baselines: ResNet (image) encoder with LSTM decoder; global attention over image regions (Hsu et al., 2021, Huang et al., 25 Dec 2025).
- Text Summarization Pipelines: Abstractive models (e.g., PEGASUS, GPT-2, BART) trained to map figure-mentioning paragraphs—optionally with OCR tokens—directly to captions (Huang et al., 2023, Huang et al., 25 Dec 2025).
- Multimodal Transformers: M4C-Captioner and BLIP, fusing visual and text streams using multimodal attention and pointer networks for handling in-figure OCR (Yang et al., 2023, Singh et al., 2023).
- Large Multimodal LLMs (LLMs/MLLMs): Off-the-shelf or fine-tuned variants (e.g., GPT-4V, LLaVA, Qwen-VL, Gemini 2.5, InternVL3, Claude-3) applied in zero-shot or prompt-engineered modes (Li et al., 2024, 2505.20740, Ng et al., 6 Jun 2025).
- Personalized Captioning: Profile-conditioned architectures leveraging in-paper figure-caption exemplars or author-style data with prompt layouts or fine-tuned models for style transfer (Ng et al., 6 Jun 2025, Kim et al., 30 Sep 2025, Timklaypachara et al., 9 Oct 2025).
Many top-performing systems treat captioning as context summarization, heavily reliant on figure-mentioning paragraphs and OCR, while state-of-the-art MLLMs have enabled direct, end-to-end, cross-modal generation with in-context or few-shot demonstrations (Huang et al., 25 Dec 2025, Ng et al., 6 Jun 2025).
4. Evaluation Methodologies
Automatic evaluation has primarily used reference-based lexical overlap metrics:
- BLEU-n:
- ROUGE-L: Based on longest common subsequence between candidate and reference
- METEOR, CIDEr, SPICE: Additional token/content overlap and scene-graph metrics
- BERTScore / MoverScore: Embedding-based similarity measures
Empirical studies report extremely low raw metric values on scientific figures (BLEU-4 often <1–6, ROUGE-L ≈10–30, even for best systems) due to the specialized vocabulary and diversity of author styles (Hsu et al., 2021, Huang et al., 2023, Huang et al., 25 Dec 2025, 2505.20740). Correlations of BLEU/ROUGE with human expert ratings are weak (). LLM-based evaluators—GPT-4/3.5 zero-shot rating based on helpfulness, informativeness, and detail given the caption and figure-mention paragraph—offer substantially higher human alignment (Kendall’s with Ph.D. raters) (Hsu et al., 2023, Huang et al., 25 Dec 2025).
Human evaluation is central for high-quality assessment, focusing on accuracy, completeness, clarity, and helpfulness. Recent benchmarks have systematically incorporated rating panels of domain experts, and hybrid pipelines now integrate LLM-judges as screening/filtering stages to augment or replace costly human annotation (Kim et al., 5 Jan 2025, Hsu et al., 2023).
5. Personalization and Multimodal Profiling
Personalization is an emerging focus, addressing the need for captions to match an author’s stylistic and domain conventions. LaMP-Cap defines a rigorous experimental protocol for personalized captioning with multimodal figure profiles—triplets of (image, caption, mention) from the same paper (Ng et al., 6 Jun 2025). Empirical results show:
- Addition of a single profile figure (image+caption+mention) boosts BLEU-4 by 4–6 points (best case: Gemini 2.5, 0.160 → 0.209); all three profiles boost BLEU-4 by +0.074 (to 0.234).
- Ablations reveal the most critical personalization signal is caption text (drop from BLEU-4 0.110→0.048 if removed), followed by image, then mention paragraph.
- Gains are largest for “profile–target aligned” pairs (contextually and figure-type similar), with BLEU-4 increases exceeding +0.20 in the aligned subset.
Competing personalizing systems further refine style via prompt chaining, category-specific instructions, and few-shot style adaptation, but also observe a trade-off: optimizing for stylistic match (n-gram overlap) can reduce factual accuracy (Kim et al., 30 Sep 2025, Timklaypachara et al., 9 Oct 2025). Human preference, as measured in challenge studies, is highest for personalized outputs incorporating only a small number of style anchors (Huang et al., 25 Dec 2025).
6. System Applications and Authoring Tools
Research has begun to address the end-user needs of scientists engaged in real writing workflows. The SciCapenter system provides a practical authoring interface: it analyzes draft PDFs, runs multi-aspect quality checks (helpfulness, key takeaway, OCR mention, visual reference, etc.), generates multiple draft captions, and supplies LLM-based star ratings and rationales to guide revision. User studies with Ph.D. authors confirm that such tooling can reduce cognitive workload, especially under time constraints, and that actionable feedback (e.g., which content elements are missing) is more valuable than directly generated candidate texts (Hsu et al., 2024).
Key insights from these deployments include:
- Structured, analytical feedback (checklists, star ratings, rationales) supports decision-making and revision better than generic AI-generated texts.
- User interfaces should expose missing content elements and encourage active author engagement; overly rigid checklists or missing actionable suggestions can frustrate users.
- Domain-adapted templates and example galleries may further improve adoption.
7. Limitations, Open Challenges, and Future Directions
Scientific figure captioning remains an unsolved, open problem in several respects:
- Contextual Incompleteness: ~19% of figures lack detectable mention paragraphs; current models are brittle without such context (Huang et al., 25 Dec 2025).
- Evaluation: Traditional overlap-based metrics underrepresent human quality; future work should systematize LLM-based evaluation and factuality verification (Hsu et al., 2023, Kim et al., 5 Jan 2025).
- Grounded Reasoning: Captions must blend detailed visual, textual, and sometimes knowledge-graph or article-level contextual signals—current architectures often omit mechanistic explanations or fail to resolve multi-panel distinctions (e.g., MMSci benchmark) (Li et al., 2024).
- Personalization vs. Accuracy: Incorporating author style can trade off against factual informativeness; explicit constraints or multi-task objectives remain underexplored (Kim et al., 30 Sep 2025).
- Author-Centric Workflows: Most studies use rewrite or proxy writing tasks; the real impact on active scientific writing is unmeasured. Longitudinal field studies and instrumented authoring environments are needed (Huang et al., 25 Dec 2025, Hsu et al., 2024).
Future directions emphasize deeper integration of multimodal context retrieval, controllable captioning for diverse user profiles (expert vs. student), hybrid editing pipelines, and new benchmarks spanning broader scientific domains and figure types.
References:
- LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles (Ng et al., 6 Jun 2025)
- Five Years of SciCap (Huang et al., 25 Dec 2025)
- MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding (Li et al., 2024)
- MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science (2505.20740)
- FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback (Singh et al., 2023)
- SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning (Yang et al., 2023)
- GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions (Hsu et al., 2023)
- SciCapenter: Supporting Caption Composition for Scientific Figures (Hsu et al., 2024)
- Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer (Kim et al., 30 Sep 2025)
- Leveraging Author-Specific Context for Scientific Figure Caption Generation (Timklaypachara et al., 9 Oct 2025)
- Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization (Huang et al., 2023)
- SciCap: Generating Captions for Scientific Figures (Hsu et al., 2021)
- Do Large Multimodal Models Solve Caption Generation for Scientific Figures? (Hsu et al., 31 Jan 2025)