VILA$^2$: VILA Augmented VILA (2407.17453v2)

Published 24 Jul 2024 in cs.CV

Abstract: While visual LLM architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.

Authors (10)

Yunhao Fang (11 papers)
Ligeng Zhu (22 papers)
Yao Lu (212 papers)
Yan Wang (733 papers)
Pavlo Molchanov (70 papers)
Jang Hyun Cho (9 papers)
Marco Pavone (314 papers)
Song Han (155 papers)
Hongxu Yin (49 papers)
Jan Kautz (215 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/drmapavone/status/1816906277521703357

https://twitter.com/GptMaestro/status/1819564900194439447

VILA$^2$: VILA Augmented VILA (2407.17453v2)

Summary

Related Papers

Tweets