CapFlow: Hierarchical Captioning Framework
- CapFlow is a hierarchical multi-agent workflow framework for visual captioning that decomposes tasks into domain-specific sub-tasks via specialized agents.
- It employs a two-stage process with domain routing via multimodal models and a hierarchical captioning workflow where agents handle tasks like perception, reasoning, and tool operations.
- By integrating open-source models with reject sampling, CapFlow achieves human-comparable caption quality at a fraction of the computational cost.
CapFlow is a hierarchical, multi-agent workflow framework designed to orchestrate open-source models for the complex task of generalist visual captioning across diverse visual domains. The approach decomposes visual captioning into manageable sub-tasks via functional specialization and domain-aware routing, enabling human-comparable caption quality at a reduced computational cost. CapFlow is the data synthesis foundation for MetaCaptioner, facilitating large-scale, high-quality caption production for images and videos by leveraging collaborate multi-agent outputs and scalable open-source model suites.
1. Architectural Overview
CapFlow integrates two main computational stages: Domain Routing and a Hierarchical Captioning Workflow. The Domain Routing component employs an open-source multimodal LLM (MLLM), such as Qwen2.5-VL, to classify the input (image or video) into a pre-defined domain (e.g., natural scenes, mathematical figures, documents). Formally, for any input visual sample ,
where is an MLLM function evaluating visual domain membership. This step determines the optimal processing pathway; if the input domain is externally specified, routing is skipped.
Once the domain is established, CapFlow activates a two-layer hierarchical workflow:
- Task-Solving Layer: Multiple specialized functional agents execute domain-relevant sub-tasks in parallel. Functional roles include guideline agents (style/structure/geography), perception agents (fine-grained details such as texture and color), reasoning agents (semantics/logical relations), and tool agents (OCR, code parsers for specialized tasks).
- Information Summarization Layer: Outputs from all functional agents are aggregated by a summary LLM into a single, coherent caption integrating factual, logical, and stylistic elements.
This pipeline is formalized algorithmically as follows:
1 2 3 4 5 6 7 |
Algorithm: CapFlow Caption Generation Input: Image I 1. d ← argmax₍d ∈ D₎ Router(I, d) 2. For each functional role j in domain d do: Sⱼ ← Agentⱼ(I) 3. Caption ← Summary({S₁, S₂, …, Sₙ}) Output: Caption |
2. Functional Agent Specialization and Collaboration
Each Agentⱼ in the workflow is instantiated with a domain-relevant open-source model suite. Agents are functionally modular, specializing in tasks such as:
Agent Role | Task Domain | Typical Model Suite |
---|---|---|
Guideline | Structure, style | Qwen2.5-VL, LLaVA |
Perception | Fine-grained details | Qwen2.5-VL, BLIP-2 |
Reasoning | Semantic/logical inferences | Vicuna-VLM |
Tool | OCR, code parsing | PP-OCR, code parsers |
Hierarchical orchestration is adaptive: For an input classified as a mathematical diagram, agents may invoke OCR and code parsing tools in conjunction with spatial reasoning capability; for a natural scene, agents prioritize perceptual granularity.
The outputs from each agent are subsequently integrated via the Summary function:
This summary agent is generally an advanced LLM capable of information fusion across modalities and perspectives.
3. Data Synthesis and MetaCaptioner Training
Employing CapFlow as a data synthesizer, high-quality, diverse captions are generated for large-scale visual datasets. Quality control is enforced through a strict reject sampling pipeline that filters low-quality outputs. The resultant dataset, e.g., MetaCaption-4.1M, enables effective fine-tuning of generalist visual captioners.
MetaCaptioner, a fine-tuned multimodal model (e.g., MetaCaptioner-8B), achieves state-of-the-art performance among open-source captioners, with benchmark results that rival GPT-4.1. This performance gain is attributable to both data quality and diversity delivered by CapFlow’s collaborative and domain-aware workflow.
A plausible implication is that rigorous multi-agent synthesis under domain routing constraints produces richer caption variety, improving generalization on tasks such as MMMU, MMVet, MathVista, and multi-domain video captioning.
4. Efficiency, Scalability, and Cost Analysis
CapFlow attains notable efficiency and scalability advantages due to its modular agent-based design. In comparative experiments, caption generation using CapFlow costs approximately \$0.14 per image, while achieving quality comparable to GPT-4.1 (which incurs a cost of about \$1.47 per image). This is made possible by parallelization of sub-tasks, hierarchical workflow structuring, and leveraging cost-effective open-source models.
Experimental ablations demonstrate that introducing the Hierarchical Workflow and Domain Routing mechanisms leads to tangible improvements in captioning performance metrics across visual domains. Scaling model size (e.g., 8B to 72B parameters) brings open-source performance near the commercial standard while maintaining reduced costs.
5. Generalization, Adaptability, and Benchmarking
CapFlow is designed to be robust across heterogeneous visual inputs, from natural scenes to structured diagrams and dynamic videos. Its adaptability is reflected in consistently high-of-domain captioning, as evidenced by benchmarks. The strict sampling and agent summary mechanisms enable logical and stylistic consistency, irrespective of input complexity.
The generalist capability of MetaCaptioner, owing to CapFlow’s high-quality synthesis, is reinforced by performance on a range of domain benchmarks (MMMU, MMVet, MathVista, and video captioning), with models achieving top-tier results in the open-source ecosystem.
6. Implications and Future Directions
The demonstration of CapFlow affirms that coordinated open-source multi-agent workflows can match commercial captioning models in quality and generalization while yielding substantial cost and scalability benefits (Lei et al., 14 Oct 2025). This suggests avenues for further research in task decomposition, agent specialization, dynamic domain routing, and information fusion in multimodal AI systems.
Open questions remain regarding domain scaling (addition of new visual domains), integration with newer agent models, and optimizing the reject sample filtering process. A plausible implication is that more fine-grained agent roles and conditional routing could further enhance caption quality and adaptability for emerging domains in multimodal AI.