Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion (2505.18115v1)

Published 23 May 2025 in cs.CV

Abstract: Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3\% on average and up to 12\% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at https://github.com/jacob-hansen/Instructify.

Collections

Summary

Overview of "Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion"

The paper "Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion" presents a comprehensive framework aimed at improving the conversion of metadata into visual instruction tuning for large multimodal models (LMMs). The authors propose a significant shift from proprietary, costly methods to open-source solutions, thereby enhancing scalability and transparency within the field of visual instruction tuning. The methodology is centered on the unification of diverse metadata into structured, high-quality instruction data, with the ultimate goal of better aligning LLMs to process and understand visual inputs effectively.

Methodology

The approach adopted by the authors is divided into four principal components which guide the conversion process:

Data Loading and Organization: This component involves the aggregation of diverse datasets, efficiently organizing them by image source and metadata type. This step is crucial for enabling the subsequent merging of information from different datasets to create a unified context for each image.
Metadata Formatting and Conversion: A novel introduction is the use of a hierarchical ASCII tree representation to convert complex bounding box data into structured text. This representation maps the spatial and semantic relationships within an image, ensuring that contextual details, such as object attributes and inter-object relationships, are preserved in the text format.
Information Management and Quality Control: The framework includes mechanisms for maintaining high data integrity and maximizing the utility of available metadata. Automated filtering and fact-checking processes are implemented to mitigate redundancy and prevent the generation of incorrect or biased instructions.
Prompt Management: This system dynamically applies various prompt strategies to generate variant instruction styles and task intents, further enhancing the diversity and relevance of the generated instructions.

Results

The authors evaluate their framework by replicating several well-known instruction-tuning datasets originally developed using proprietary models like GPT-4. The results from training the LLaVA-Next model on these reproduced datasets demonstrate consistent improvements in performance across a range of benchmarks. Notably, the reproduced datasets generally surpass the original data in terms of instructional quality, as evidenced by improvements up to 12% in certain metrics when using open LLMs, such as Gemma 2 27B and LLaMA 3.1 70B, for data generation.

Implications and Future Directions

The implications of this research are twofold. Practically, it paves the way for more accessible and scalable methods for creating visual instruction tuning datasets without reliance on closed-source models, thus reducing costs and increasing reproducibility in the field. Theoretically, the work suggests future paths for refining multimodal models through improved data curation and conversion techniques.

The successes of open models in generating high-quality instruction sets indicate a promising direction for AI development that emphasizes collaboration and transparency. Furthermore, the advancements in metadata conversion and hierarchical representation can inspire future models to incorporate more intricate and detailed data processing methods for improved visual comprehension.

In conclusion, "Instructify" not only addresses the current challenges in metadata conversion but also provides a robust foundation for future developments in multisensory data processing and integration within the AI community. The insights offered in this paper are expected to catalyze further innovations in visual instruction tuning methodologies.