Planting a SEED of Vision in Large Language Model (2307.08041v2)

Published 16 Jul 2023 in cs.CV

Abstract: We present SEED, an elaborate image tokenizer that empowers LLMs with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

PDF Abstract

Overview of "Planting a SEED of Vision in LLM"

This paper introduces SEED, a sophisticated image tokenizer crafted to enhance LLMs with both the capability to comprehend and generate visual content. Traditional approaches to image tokenization have encountered obstacles, primarily due to the inadequacies in multimodal comprehension and generation in existing frameworks that use visual tokens. SEED addresses these limitations by dictating that image tokens follow two essential principles: taking a one-dimensional causal dependency approach and encapsulating high-level semantic information.

The SEED model capitalizes on the inherent ability to align visual and textual representations seamlessly, achieving scalable multimodal training without departing from the standard LLM training methodologies. This proposition is tested by using an off-the-shelf LLM combined with SEED to facilitate efficient image-to-text and text-to-image generation, demonstrating notable potential for discrete visual tokens in enhancing versatile multimodal LLMs.

Key Insights

Training and Architectural Design: SEED establishes a framework whereby image tokens mirror the autoregressive mechanism used in LLMs with one-dimensional causal dependency. This aligns with the LLMs’ left-to-right prediction mechanism, enabling seamless integration and training alongside textual data. The SEED architecture also integrates high-level semantics in visual tokens, ensuring semantic consistency and fostering better integration with natural LLMs.
Tokenization Process: The SEED tokenizer comprises a ViT encoder, Causal Q-Former, VQ Codebook, Reverse Q-Former, and a UNet decoder. This setup allows the tokenizer to handle both visual comprehension and generation tasks effectively, with training focused on image-text contrastive loss and dual reconstruction objectives. These ensure accurate semantic representation and alignment between visual tokens and textual spaces used in LLMs.
Implementation and Results: SEED trained in just 5.7 days using 64 V100 GPUs and 5 million publicly available image-text pairs, exhibits competitive performance in tasks like text-image retrieval and image generation. With efficient LoRA tuning, the model demonstrates promising outcomes in zero-shot image captioning and visual question answering, despite using a smaller dataset compared to previous LLM implementations such as BLIP-2.
Potential Applications: The research further touches upon the promise of the SEED model in various applications, including multimodal comprehension tasks and generative scenarios within LLM frameworks. This unification of visual and textual generation tasks into a single LLM framework implies substantial future possibilities for AI-driven methodologies in multimodal understanding and representation.

Practical and Theoretical Implications

SEED’s approach unlocks potential cost and complexity reductions for training multimodal LLMs, primarily by facilitating semantic alignment between text and image representations. The paper posits that discrete visual tokens could promote less resource-intensive training paradigms, advocating for environmentally sustainable large-scale model training protocols. In terms of theoretical developments, SEED serves as a foundational step towards integrating imagery processing capabilities within autoregressive LLMs, potentially paving the way for emergent multimodal LLMs with sophisticated visual interaction faculties.

Overall, SEED represents a significant contribution to the field of AI and machine learning, charting pathways for enhanced LLM frameworks that integrate visual processing in comprehensive, semantically rich manners. Future research is suggested to explore scaling this approach with more powerful LLMs—such as LLaMA—to realize the full potential of SEED in multimodal comprehension and generation tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yuying Ge (39 papers)
Yixiao Ge (99 papers)
Ziyun Zeng (16 papers)
Xintao Wang (132 papers)
Ying Shan (252 papers)

Citations (69)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos