Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (2407.03320v1)

Published 3 Jul 2024 in cs.CV and cs.CL
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision LLM that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

InternLM-XComposer-2.5: A Versatile Large Vision LLM Supporting Long-Contextual Input and Output

The presented paper discusses InternLM-XComposer-2.5 (IXC-2.5), a significant advancement in Large Vision LLMs (LVLMs) that emphasizes long-contextual input and output capabilities, enabling a multitude of sophisticated applications across text-image comprehension and composition. This model represents a substantial progress over its predecessor, IXC-2.0, primarily due to its enhanced architecture and expanded capabilities.

Key Model Enhancements

IXC-2.5 incorporates three primary enhancements aimed at advancing vision-language comprehension:

  1. Ultra-High Resolution Understanding: Utilizing a 560 × 560 Vision Transformer (ViT) encoder, IXC-2.5 enables the processing of high-resolution images with various aspect ratios.
  2. Fine-Grained Video Understanding: Videos are treated as high-resolution composite images comprising numerous frames, capturing fine details via dense sampling of each frame.
  3. Multi-Turn Multi-Image Dialogue: The model supports extended, complex interactions involving multiple images over many turns, improving the fluidity of human-like conversations.

Furthermore, IXC-2.5 extends its capabilities to two critical text-image composition applications:

  1. Crafting Webpages: Leveraging additional LoRA parameters, IXC-2.5 can generate source codes for webpages from text-image instructions.
  2. Composing High-Quality Text-Image Articles: Implementing Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques, IXC-2.5 produces high-quality written content with corresponding images.

Training and Model Architecture

The IXC-2.5 model emphasizes long-contextual interaction, trained with 24,000 interleaved image-text contexts and capable of extending to 96,000 contexts via RoPE extrapolation. The architecture includes a Vision Encoder OpenAI ViT-L/14, a LLM InternLM2-7B, and Partial LoRA for alignment with vision encoders.

The pre-training phase focuses on three tasks using a diverse dataset: General Semantic Alignment, World Knowledge Alignment, and Vision Capability Enhancement. This preparatory phase ensures the model's adeptness in processing various vision-language inputs.

Benchmark Performance

IXC-2.5 demonstrates state-of-the-art performance across a wide range of benchmarks:

  • Video Understanding: Outperformed existing models on four out of five benchmarks, including MVBench and MME-Video, delineating its proficiency in fine-grained video tasks.
  • Structural High-Resolution Benchmarks: Achieved notable results on DocVQA, ChartQA, TextVQA, demonstrating its capacity to handle complex visual information.
  • General Visual QA Benchmarks: Excelled in MMStar, RealWorldQA, and others, showcasing its versatility.
  • Multi-Image Multi-Turn Dialogue: Surpassed prior models in MMDU, highlighting advanced conversational abilities.

Webpage Generation and Article Composition

In terms of webpage generation, IXC-2.5 extends its capabilities to:

  1. Screenshot-to-code: Achieved high scores on the Design2Code benchmark, demonstrating near GPT-4v level performance in translating visual designs into code.
  2. Instruction-Aware Webpage Generation: Trained on synthetic and real-world datasets to convert textual instructions into webpage designs, including interactive elements like JavaScript.
  3. Resume-to-homepage: Created personal homepages from resumes, showcasing practical applicability.

For article composition, the model utilizes a multi-step pipeline including supervision, reward modeling, preference data collection, and DPO alignment, resulting in stable and high-quality text-image articles.

Implications and Future Directions

The enhancements and comprehensive capabilities of IXC-2.5 make it a robust tool for a variety of practical applications, from webpage design to intricate visual QA tasks. The model’s ability to handle long-contextual interactions positions it as a pivotal advancement for future AI developments. Future research can extend IXC-2.5’s long-context capabilities to more complex and extended multi-modal environments, such as continuous video streams or prolonged dialogue histories, thus broadening its applicability in real-world scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (27)
  1. Pan Zhang (153 papers)
  2. Xiaoyi Dong (73 papers)
  3. Yuhang Zang (54 papers)
  4. Yuhang Cao (41 papers)
  5. Rui Qian (50 papers)
  6. Lin Chen (384 papers)
  7. Qipeng Guo (72 papers)
  8. Haodong Duan (55 papers)
  9. Bin Wang (750 papers)
  10. Linke Ouyang (12 papers)
  11. Songyang Zhang (116 papers)
  12. Wenwei Zhang (77 papers)
  13. Yining Li (29 papers)
  14. Yang Gao (761 papers)
  15. Peng Sun (210 papers)
  16. Xinyue Zhang (63 papers)
  17. Wei Li (1121 papers)
  18. Jingwen Li (29 papers)
  19. Wenhai Wang (123 papers)
  20. Hang Yan (86 papers)
Citations (51)
Youtube Logo Streamline Icon: https://streamlinehq.com