MLBCAP: Multi-LLM Figure Captioning

Updated 11 August 2025

The paper introduces a modular, multi-LLM pipeline that filters, generates, and refines figure captions to enhance contextual accuracy.
The framework integrates multimodal inputs—images, OCR tokens, and textual references—to produce informative and style-aligned scientific captions.
Empirical evaluations show MLBCAP outperforms single-model baselines by significantly improving caption quality and clarity.

Multi-LLM Collaborative Figure Caption Generation (MLBCAP) is a modular paradigm for generating high-quality, context-aware, and domain-relevant captions for scientific figures by orchestrating multiple LLMs, each specializing in specific subtasks. This approach arises from the recognition that figure captioning in academic documents is a complex, multimodal challenge—requiring reasoning over heterogeneous data (image, OCR tokens, textual mentions), mitigating low-quality annotations prevalent in paper corpora, and aligning outputs with the distinct style and communicative goals of scientific writing.

1. Motivation and Foundational Challenges

The scientific figure captioning problem is defined by the need to synthesize image content, document context, and domain conventions into coherent and informative descriptions. Prior approaches—either as vision-to-language or text summarization tasks—often underperform due to (a) limited integration of available modalities, and (b) the intrinsically low and variable quality of author-written captions typically found in resources like arXiv (Huang et al., 2023, Kim et al., 5 Jan 2025).

Typical captioning corpora are encumbered by incomplete figure–caption alignment, absence of standardized "helpfulness" criteria, and coverage issues (roughly half the author-written captions are unhelpful for human understanding) (Huang et al., 2023, Hsu et al., 31 Jan 2025). These limitations hinder supervised training regimes and expose models to exposure bias and hallucination risks, especially for long or complex captions.

These challenges motivate the MLBCAP schema: assembling a pipeline where the responsibilities for data filtering, candidate generation, and final refinement are distributed among specialized agents or models, thus leveraging their combined competencies for optimal caption production (Kim et al., 5 Jan 2025, Yang et al., 3 Jan 2025).

2. Framework Decomposition and Model Roles

MLBCAP instantiates a workflow composed of three coordinated modules, each powered by distinct LLMs or multimodal agents (Kim et al., 5 Jan 2025):

(i) Quality Assessment Module

A multimodal LLM (e.g., LLaVA, GPT-4o) is tasked with assessing and filtering candidate training samples. It does so based on informativeness, alignment, and correctness of captions for each figure, as evaluated against multiple sources—figure images, mention paragraphs, and OCR text. Fine-tuning on a synthetic quality-annotated subset (with scores from 1–6) ensures that only samples rated “5” or “6” are retained, thus improving reliability for subsequent model training.

(ii) Diverse Caption Generation Module

Multiple LLMs (specialized for domain adaptation, summarization, or generative diversity) generate candidate captions. These include:

Foundation models (e.g., GPT-4o) used in few-shot or prompt-based settings.
Fine-tuned models on domain data (e.g., LLaMA-3-8B, Yi-1.5-9B), adapted to the filtered dataset using multi-source prompts ([Figure], [Paragraphs], [Mentions], [OCR], etc.).
Specialized summarization models (e.g., Pegasus), optimized for extracting key content from textual mentions and OCR snippets.

This ensemble ensures coverage across different reasoning styles, presentation forms, and granularity levels.

A prominent LLM (typically GPT-4o or a comparable LLM) then aggregates, ranks, and edits the candidate captions. This LLM acts both as a critic—screening for factual, stylistic, and structural quality—and as a final editor, addressing inaccuracies in numerics, nomenclature, or linguistic expression. MLBCAP supports generation of both “long” and “short” captions through controlled word limits, enhancing flexibility for varied editorial use cases.

Table: MLBCAP Modular Workflow

Module	Typical Model	Purpose
Quality Assessment	LLaVA / GPT-4o	Filter low-quality training data
Diverse Caption Gen.	GPT-4o, LLaMA, Pegasus	Generate candidate captions
Judgment/Refinement	GPT-4o	Select and edit final caption

3. Evaluation Protocols and Empirical Effectiveness

MLBCAP is validated via both automatic and human-centric metrics. While traditional metrics (ROUGE, BLEU) are reported, their alignment with human preferences is weak—especially in academic domains where detail and informativeness outweigh mere n-gram overlap (Huang et al., 2023, Hsu et al., 31 Jan 2025). As such, the efficacy of MLBCAP is primarily supported by human evaluation experiments (Kim et al., 5 Jan 2025).

Computer vision and academic writing experts consistently rank MLBCAP’s long captions above single-model baselines and even above author-written captions, particularly in terms of informativeness, clarity, and completeness. Ablation studies demonstrate that collaborative fusion (i.e., combining multiple LLM outputs with post-selection editing) contributes the largest improvements (e.g., raising the SciCap-Eval score from 5.390 to 5.440).

4. Model Specialization, Collaborative Coordination, and Technical Aspects

MLBCAP’s design encourages modular specialization:

Some models are tuned for detail retrieval from visual data, mimicking label and relation map attention mechanisms analogous to those proposed in earlier work (Chen et al., 2019).
Others, such as summarization LMs fine-tuned on mention paragraphs and OCR tokens, are optimized for extracting document context (Huang et al., 2023, Yang et al., 2023).
Ensemble models resolve tasks via cross-candidate consensus, rank aggregation, or reward-based selection (mirroring self-critical RL objectives and ensemble voting observed in vision–language research) (Bianco et al., 2023, Lee et al., 2024).

Judgment modules can use scoring metrics (including learned reward models from human feedback (Singh et al., 2023)) or utilize human-crafted scales with high inter-rater reliability.

The technical specification for each step includes explicit prompt structures with placeholders, controlled vocabularies for scientific nomenclature, and JSON-formatted output structures for interoperability. Training and inference rely on transformer-based models, with learning rates, batch sizes, and token limits optimized for each submodule (Kim et al., 5 Jan 2025).

5. Extensibility, Personalization, and Multiagent Variants

MLBCAP is extensible to various scientific domains and supports personalized captioning by conditioning on multimodal figure profiles, as shown in the LaMP-Cap framework (Ng et al., 6 Jun 2025). By incorporating historical context (images, captions, related paragraphs from the same author or paper), MLBCAP can learn stylistic and content conventions, which enhances the alignment of model generations with intended authorial voice.

Agent-based variants (e.g., MoColl (Yang et al., 3 Jan 2025)) can further decompose the collaborative process as a series of question–answering steps: a VQA agent extracts domain-specific facts, while a generalist LLM synthesizes these facts into coherent, context-rich captions.

MLBCAP’s pipeline also supports personalized or domain-specific configuration. Models can be assigned roles based on known competencies (e.g., visual extraction, linguistic polish, factual validation), and final outputs can be selectively assembled according to user-defined criteria.

6. Limitations and Future Research Directions

Several open challenges remain for MLBCAP:

Quality of Reference Data: Even after filtering, author-written captions in training data may still lack coverage or precision, necessitating further advances in synthetic quality scoring, human feedback integration, and iterative refinement protocols (Yin et al., 10 Jan 2025, Singh et al., 2023).
Evaluation Bias: Human-preferred outputs do not always align with standard metrics; the adoption of dual-metric evaluation for factuality and coverage (as in (Lee et al., 2024)) provides a more robust assessment framework but requires domain-specific benchmarks.
Coordination Overhead: The orchestration of multiple LLMs imposes computational and logistical burdens, particularly when integrating external models or human-in-the-loop feedback.
Personalization: Representing and leveraging author, domain, or venue profiles in a scalable, data-efficient manner is a developing area (Ng et al., 6 Jun 2025).
Data Privacy and Generalizability: Personalized or context-aware systems must respect the privacy of research data while ensuring generalization across diverse scientific fields.
Automation vs. Human-in-the-Loop: Studies reveal that authors rarely accept AI-generated captions verbatim—instead, they iteratively adapt, correct, or merge multiple suggestions (Yin et al., 10 Jan 2025). MLBCAP systems that interleave AI generation and human review are thus best positioned for robust scientific communication.

7. Significance for the Future of Figure Captioning

The MLBCAP paradigm marks a transition from monolithic captioning pipelines toward compositional, collaborative systems that are robust to noisy training data, flexible in their semantic coverage, and amenable to iterative human interaction. This architecture generalizes well to multilingual, personalized, and multimodal figure captioning tasks, facilitating advances in automation, human–AI collaboration, and reproducible evaluation (Kim et al., 5 Jan 2025, Huang et al., 2023, Ng et al., 6 Jun 2025).

By foregrounding specialized LLMs for assessment, content generation, and critical revision, and by leveraging multi-source information (figure, OCR, mention text, prior figures), MLBCAP systems are positioned to meet the rising standards and nuanced demands of scientific figure communication.