Papers
Topics
Authors
Recent
2000 character limit reached

SciCap Project: Advancing Figure Captioning

Updated 1 January 2026
  • SciCap Project is a multiyear initiative that automates caption generation for scientific figures, tables, and equations using context-aware, multimodal models.
  • It leverages large-scale real-world dataset extraction, rule-based and CNN-powered figure separation, and transformer-based techniques to enhance caption quality.
  • The project drives annual competitions and benchmarking that have accelerated advancements in accessible, context-rich scientific communication.

The SciCap Project is a multiyear initiative focused on automating the generation of high-quality, contextually informative captions for scientific figures, tables, equations, and related visual objects in scholarly papers. Building on large-scale dataset creation, rigorous modeling, and evaluation—including annual open competitions—the project has shaped technical progress at the intersection of natural language processing, computer vision, and human-centered scientific communication.

1. Origins, Goals, and Evolution

The SciCap Project was conceived to address the prevalence of low-quality, often generic or uninformative captions in scientific literature, which diminish both accessibility (notably for visually impaired readers) and the overall interpretability of scientific findings (Hsu et al., 2021, Huang et al., 25 Dec 2025). Its foundational ambitions were to:

  • Release the first large-scale, real-world dataset of figure–caption pairs, primarily culled from arXiv preprints in computer science and related fields.
  • Test the hypothesis that domain-specific, context-aware models—beyond standard vision-LLMs—can more accurately capture the semantics and communicative function of scientific captions.
  • Create automated tools to support scientists, with generated captions approaching or exceeding the quality of those written by domain experts.

The project rapidly evolved. SciCap v1.0 (2021) released over 133,000 single-panel graph plots with captions, followed by the incorporation of contextual data (mentioning paragraphs, OCR of in-figure text), culminating in datasets such as SciCap+ (2023) and multimodal LaMP-Cap. By 2023–2025, annual SciCap Challenges attracted the global community, and the corpus grew to include over 476,000 figures spanning eight scientific domains, five principal figure types, and diverse metadata (Huang et al., 25 Dec 2025, Hsu et al., 31 Jan 2025).

2. Dataset Construction and Annotation

The SciCap family of datasets is defined by scale, real-world heterogeneity, and multimodal annotation.

  • Source and Filtering: Initial datasets used arXiv bulk PDF downloads, filtering computer science and machine learning papers (2010–2020), with PDFFigures 2.0 extracting over two million figures. Compound figures were separated using rule-based cues and CNN-based tools (e.g., Compound FigureSeparator).
  • Figure Typing: Classifiers (notably FigureSeer) assigned figures to categories: graph plots, tables, equations, flowcharts, scatter plots, bar charts, and others (Hsu et al., 2021).
  • Caption Selection: Captions were normalized (number/bracket substitutions) and split into "first-sentence," "single-sentence," or "≤100-word" sets to facilitate training and benchmarking.
  • Multimodal Context: SciCap+ (Yang et al., 2023) and subsequent datasets paired figures not only with captions, but also:
    • "Mention paragraphs"—first running-text occurrences citing the figure.
    • OCR tokens and bounding boxes for all in-figure text.
    • Paper- and domain-level metadata (e.g., scientific subfield, figure category, profile figures for style transfer).

Dataset statistics (SciCap+) include~414k train figures, ~10k validation and test each, with caption lengths typically ≤66 words and mention-paragraphs capped at 192 subwords (Yang et al., 2023). The LaMP-Cap corpus (2025) features 110,828 articles, each with one target figure and up to three profile figures with eight modalities of context (Timklaypachara et al., 9 Oct 2025).

3. Modeling Approaches and System Architectures

The project has served as a testbed for diverse captioning methods, ranging from classic vision–LLMs to large-scale transformer and LLM-based approaches.

  • Vision and Text Baselines: Early work benchmarked ResNet–LSTM architectures with attention, using either image features, OCR text, or both (Hsu et al., 2021). These approaches achieved low BLEU-4 scores (<0.03), indicating difficulty in capturing caption semantics from visual or shallow textual cues alone.
  • Multimodal Transformers: M4C-Captioner (Yang et al., 2023) introduced multimodal, pointer-equipped transformers encoding images (ResNet-152), text (SciBERT for mention paragraphs), and OCR tokens (FastText+PHOC embeddings, Faster R-CNN features, and spatial coords). Generation used a weighted sum of vocabulary and copy distributions.
  • Summarization-Based Models: The major insight of the SciCap Project is that captions are most often concise contextual summaries of figure-mentioning paragraphs—approximately 75% of caption words appear in the associated text (Hsu et al., 31 Jan 2025). Fine-tuned summarizers (Pegasus, Flan-T5-XL, LLaMA) proved highly effective (Huang et al., 25 Dec 2025, Li et al., 2024).
  • Auxiliary Information Integration: Leading entries in SciCap 2024 (Li et al., 2024) incorporated:
    • Accurate OCR (via PaddleOCR)
    • Filtered, informativeness-scored text chunks
    • Object mention spans, with fusion by learned gating in the transformer decoder
    • Ensemble checkpoint ranking for increased robustness
  • Prompt Optimization and Personalization: The 3rd SciCap Challenge (Timklaypachara et al., 9 Oct 2025) utilized a two-stage approach:
    • Category-specific prompt optimization (DSPy MIPROv2 and SIMBA) for domain fidelity
    • Few-shot author-style prompting (profile figure-caption pairs) for stylistic consistency, mediated via LLMs
  • Large Multimodal Models (LMMs): GPT-4V demonstrated state-of-the-art human-judged captions, outperforming all open and proprietary models as well as original author-written captions (Hsu et al., 31 Jan 2025). Notably, reference-based automatic metrics (BLEU, ROUGE) did not correlate with these human preferences.

4. Evaluation Frameworks and Benchmarking

Captioning systems are assessed across both automatic, reference-based metrics and human evaluation protocols.

Automatic Metrics

These include:

  • BLEU-N: N-gram precision with brevity penalty
  • ROUGE-1/2/L: Recall and precision of unigrams, bigrams, and longest common subsequences. Normalized variants account for length bias
  • METEOR and CIDEr: Additional n-gram similarity and consensus metrics (used in human–vs–machine annotation studies) (Yang et al., 2023, Hsu et al., 2021)

Human Evaluation

Protocols have included:

  • Expert rater (Ph.D.-level) scales of “helpfulness” (1–6) given context (Huang et al., 25 Dec 2025).
  • Undergraduate and professional editor rankings; editors overwhelmingly preferred GPT-4V outputs to author or model baselines by large margins (Hsu et al., 31 Jan 2025).
  • LLMs (GPT-4) as zero-shot judges with Kendall's τ correlation up to 0.40 with Ph.D. raters, outperforming non-expert human scorers.

Results have revealed:

  • Vision-only or even OCR+vision approaches lag far behind context-summarization baselines.
  • Summarization models (Pegasus, Flan-T5-XL, LLaMA) fine-tuned on mentioning paragraphs boost BLEU and ROUGE scores by up to 30% (Huang et al., 25 Dec 2025).
  • GPT-4V (image+paragraph) captions, despite lower reference metric scores, rank first in human preference in 70% of editor trials, challenging the relevance of n-gram metrics for this task (Hsu et al., 31 Jan 2025).

5. Competition Structure, Signal Approaches, and Results

Annual SciCap Challenges (2023–2025) have catalyzed rapid methodological progress by providing common evaluation scaffolds, new datasets, and human-in-the-loop rankings (Hsu et al., 31 Jan 2025, Timklaypachara et al., 9 Oct 2025, Li et al., 2024).

2023–2024: System Approaches

Year Winner(s) Key Methods Notable Techniques Human Score/Rank
2023 NJUST-KMG Pegasus summarization + PaddleOCR BRIO contrastive ranking Norm.ROUGE-2 (4.489)
2023 USTC Pegasus summarization + BLIP-2 Caption enrichment Norm.ROUGE-2 (2.418)
2023 GPT-4V Proprietary image+text LLM Zero-shot human evaluation #1 choice in ≥70% of editor trials
2024 Ours* Pegasus/LLaMA2+OCR+filtering Ensemble, gating Human: 4.33 (long), 4.66 (short)

*(Li et al., 2024), reporting first place in both human-rated short and long caption tracks.

Competition Protocols

  • Large-scale, open-access train/test splits (~400k train, ~48k test; 2023 and 2024).
  • Two tracks: short captions (≤20 words), long captions (≤50 words), with human and automatic metric leaderboards (Li et al., 2024).
  • Evaluation metrics prioritize normalized recall for length fairness.
  • “Quality Subset” of reference captions (scored by GPT-4, then manually filtered) as an oracle comparison (Hsu et al., 31 Jan 2025).

Key findings:

  • OCR augmentation improves bigram recall where object labels are mission-critical.
  • Paragraph filtering (informativeness thresholding, λ≈1.05–1.2) reduces distractor context and sharpens content alignment.
  • Incorporating paper-specific style profiles yields 40–48% BLEU and 25–27% ROUGE improvements (Timklaypachara et al., 9 Oct 2025).

6. Key Technical Insights, Limitations, and Open Challenges

Lessons

Outstanding Challenges

  • Captioning with incomplete or absent textual context remains unresolved; figures lacking mentioning paragraphs require new pretraining or reasoning resources (Huang et al., 25 Dec 2025).
  • Adapting output to different reader expertise or communication norms; this suggests future research on controllable caption generation.
  • Unified benchmarks for both captioning and deep figure content understanding (e.g., parsing, question answering) are lacking (Huang et al., 25 Dec 2025).
  • Improving image-text grounding to avoid hallucinated facts or missed core insights.
  • Real-world, in-process writer studies to measure AI tool impact on authoring workflows.
  • Robust evaluation schemes that penalize hallucinations and reward domain-precise recall, particularly with future multimodal, author-style, or audience-adapted models.

Current Limitations

  • Most high-performing models still depend heavily on external or filtered context; free-standing image captioning remains far from solved (Hsu et al., 31 Jan 2025).
  • Pure text generation neglects visual signals that are not described in text or OCR (Li et al., 2024).
  • OCR errors, visually dense diagrams, and equation-heavy objects introduce cascading noise and performance bottlenecks (Huang et al., 25 Dec 2025, Yang et al., 2023).

7. Resources, Community Impact, and Future Directions

The SciCap initiative has released all major datasets (SciCap, SciCap+, LaMP-Cap), benchmarking code, and interactive caption drafting platforms (SciCapenter), facilitating broad adoption and reproducibility (Yang et al., 2023, Huang et al., 25 Dec 2025). The annual challenges have attracted global participation, raising the quality bar each year.

Proposed future avenues include:

  • Pretraining multimodal transformers directly on figure–mention pairs with figure-centric vision encoders.
  • Incorporating structured external knowledge (e.g., ontologies) for improved interpretability (Yang et al., 2023).
  • Adaptive, graph-based context selection and dynamic prompt refinement for more coherent, factual, and style-aligned output (Timklaypachara et al., 9 Oct 2025).
  • Semi-automatic “human-in-the-loop” captioning workflows utilizing LMM output as assistive drafts for expert refinement (Hsu et al., 31 Jan 2025).
  • Audience-adaptive and style-controllable models that tailor captions to reader profiles (Huang et al., 25 Dec 2025).
  • Benchmarks and metrics sensitive to factuality, visual–text consistency, and real-world scientific informativeness.

In summary, SciCap has systematically transformed figure captioning from a vision-task analog into a deeply contextual, multimodal, and human-centered challenge, anchoring research at the junction of robust AI, scientific communication, and accessibility (Huang et al., 25 Dec 2025, Li et al., 2024, Hsu et al., 31 Jan 2025, Timklaypachara et al., 9 Oct 2025, Yang et al., 2023, Hsu et al., 2021).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SciCap Project.