COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning (2504.21850v1)

Published 30 Apr 2025 in cs.CV

Abstract: Multimodal LLMs (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

Collections

Summary

The paper presents COMPACT, a novel data recipe that trains multimodal models to combine 10 atomic visual capabilities into complex ones.
It demonstrates that mixing controlled compositional data with a small VIT subset can yield up to 94% improvement on multi-capability tasks.
The approach underscores data efficiency by achieving comparable performance with less than 10% of standard VIT data while maintaining instruction-following abilities.

The paper "COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning" (2504.21850) addresses the challenge that while Multimodal LLMs (MLLMs) excel at simple vision-language tasks, they often struggle with complex tasks requiring the combination of multiple visual capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. The authors hypothesize that this limitation stems partly from traditional Visual Instruction Tuning (VIT) datasets, like LLaVA-665K, which are primarily composed of simple queries requiring only one or two capabilities, lacking sufficient compositional complexity.

To tackle this, the paper proposes COMPACT, a novel data recipe for visual instruction tuning. COMPACT aims to explicitly train MLLMs on examples designed to improve their ability to combine atomic visual capabilities into composite ones. The core idea is to generate a training dataset with controlled compositional complexity, enabling models to learn complex capabilities more efficiently and with less data.

The methodology of COMPACT revolves around defining 10 atomic visual capabilities, categorized into Attribution (Color, Shape), Recognition (Object Recognition, Action Recognition, Text Recognition, Spatial Recognition, Counting), and Relation (Spatial Relationship, Object Interaction, Scene Understanding). The data generation pipeline involves four steps:

Capability Sampling: For randomly sampled images, the process selects a combination of $k$ atomic capabilities, typically $k \in \{1, 2, 3\}$ , ensuring diversity and avoiding duplicate combinations for the same image.
Conversation Generation: A vision-LLM (specifically Gemini-2.0-Flash is used in the paper) is prompted to generate a natural question-answer pair for the selected image that requires exactly the $k$ sampled capabilities. Constraints are applied to ensure questions are visually grounded, unambiguous, concise, and naturally integrated (not just conjoined single-capability questions).
Quality Verification: A verification step, also using Gemini-2.0-Flash, filters out low-quality conversations, checks for word overlap with previous examples, and crucially, verifies that the generated question indeed requires exactly the specified $k$ capabilities and no others.
Dataset Assembly: The final training dataset combines the generated compositional tuning data with a small subset (5% is found to be optimal) of a standard VIT dataset like LLaVA-665K. This mixture ensures the model retains general instruction-following capabilities while learning compositional skills.

The paper demonstrates that training LLaVA-v1.5-7B-LoRA [26296] on the COMPACT dataset, which is less than 10% the size of the full LLaVA-665K VIT data, achieves performance comparable to or better than training on the full LLaVA-665K dataset across various multimodal benchmarks. Notably, COMPACT shows substantial improvements on complex, multi-capability tasks. For questions requiring four or more atomic capabilities, COMPACT achieved an 83.3% improvement on MMStar (Chen et al., 29 Mar 2024) and a 94.0% improvement on MM-Vet (Yu et al., 2023) compared to the full LLaVA-665K baseline.

Key findings from the experiments highlight the data efficiency of COMPACT, demonstrating that training with significantly less data can match or surpass larger standard datasets, especially for complex tasks. Ablation studies confirm that a balanced distribution of compositional complexity ( $k$ ) across the training data is critical for compositional generalization. Training models on datasets with a wider range of complexities ( $k=1, 2, 3$ ) leads to better performance on higher- $k$ tasks compared to training only on $k=1$ or $k=1, 2$ . The analysis also revealed that capabilities like scene understanding and spatial relationship are particularly important for overall performance, while the inclusion of a small amount of original VIT data is crucial for maintaining general instruction-following ability.

Practical Implementation Considerations:

Data Generation Cost: The data generation process relies on querying large, potentially closed-source models (like Gemini-2.0-Flash), which can be computationally expensive and may require significant API costs. The authors plan to release the generated dataset to mitigate reproducibility challenges.
Dependency on VLMs: The quality and biases of the generated compositional data are dependent on the capabilities of the VLM used for generation and verification. If the base model struggles with certain compositions or interpretations, these limitations might be reflected in the generated dataset.
Balancing Instruction Following and Compositional Tuning: The paper shows that mixing the compositional data with a small subset of traditional VIT data is necessary. Finding the right balance (e.g., the 5% found in the paper) is crucial for optimal performance; too little VIT data impairs instruction following, while too much dilutes the compositional learning signal.
Extending to Higher Complexities: The current approach is limited to $k=3$ due to the decreasing reliability of current VLMs in generating and verifying questions with higher compositional complexity. Implementing COMPACT for more complex scenarios would require more sophisticated data generation and verification strategies.
Hardware Requirements: While COMPACT is data-efficient in terms of data volume, fine-tuning MLLMs still requires substantial computational resources (GPUs) for training, similar to standard VIT.

In essence, COMPACT provides a recipe for generating structured training data that explicitly targets compositional visual reasoning, offering a data-efficient alternative to simply scaling data volume for improving MLLM performance on complex tasks.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

Tweets

https://twitter.com/cindy_x_wu/status/1918425640132591923

https://twitter.com/HuggingPapers/status/1918638522589671516

https://twitter.com/arxivsanitybot/status/1918496789625237565