Seed1.5-VL Technical Report (2505.07062v1)

Published 11 May 2025 in cs.CV and cs.AI

Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Summary

The paper introduces Seed1.5-VL, a multimodal model that synergizes a 532M vision encoder with a 20B-parameter MoE LLM to achieve state-of-the-art results on 38 public benchmarks.
It details innovative training stages and methods, including dynamic frame-resolution sampling, hybrid parallelism, and reinforcement learning for improved visual reasoning.
The report highlights practical applications in GUI control, gameplay, and video tasks while discussing challenges in fine-grained perception and complex spatial reasoning.

Seed1.5-VL is a vision-language foundation model developed by ByteDance Seed, designed for general-purpose multimodal understanding and reasoning. The model features a relatively compact architecture, consisting of a 532-million-parameter vision encoder and a Mixture-of-Experts (MoE) LLM with 20 billion active parameters. The technical report details the experiences in building this model, covering its design, data construction, training methodologies, and infrastructure innovations, aiming to inspire further research in the field. Seed1.5-VL has demonstrated strong performance across numerous public and internal benchmarks, achieving state-of-the-art (SOTA) results on 38 out of 60 public benchmarks, and excelling in agent-centric tasks like GUI control and gameplay. The model is accessible on Volcano Engine.

The architecture of Seed1.5-VL comprises three main components: the Seed-ViT vision encoder, an MLP adapter, and a pre-trained LLM. Seed-ViT is specifically designed for native-resolution feature extraction and incorporates 2D RoPE for flexible positional encoding across varying image dimensions. It processes images by first resizing to a multiple of $28 \times 28$ , segmenting into $14 \times 14$ patches, projecting to tokens, and applying attention masks for batched images, followed by a $2 \times 2$ average pooling. Encoder-free architectures were not pursued due to the efficiency of the vision encoder in image compression. For video inputs, Seed1.5-VL uses a Dynamic Frame-Resolution Sampling strategy that adjusts frame rate (1, 2, or 5 FPS) and spatial resolution (from six predefined levels) dynamically to balance semantic detail and computational cost, while staying within a maximum token budget of 81,920 tokens per video. Timestamp tokens are prepended to each frame to enhance temporal awareness.

The pre-training of Seed1.5-VL is conducted on a large corpus of 3 trillion multimodal tokens. This data is meticulously curated across various categories to build specific capabilities:

Generic Image-Text Pairs: Web-sourced data is filtered for noise and re-balanced to address long-tail distributions of visual concepts by using a precursor VLM for semantic domain/entity annotation and duplicating data from underrepresented domains.
OCR: An in-house dataset of over 1 billion samples covers diverse formats like documents, scene text, tables, and charts. Synthetic data is generated using tools like SynthDog and LaTeX, and an LLM-based pipeline for charts. Data augmentation techniques like blurring and distortion enhance robustness. A VQA dataset complements structured data to improve textual content comprehension.
Visual Grounding & Counting: Training uses bounding box and center point representations. Data sources include public datasets (Objects365, OpenImages, RefCOCO/+/g), filtered and transformed into multi-task training data (2D grounding, spatial QA, visual prompt QA). An automatic annotation pipeline leveraging Grounding DINO is used for scaling up. Point data comes from PixMo-Points, Molmo, and CountGD. Counting data is derived from box and point annotations. All coordinate values are normalized to [0, 999].
3D Spatial Understanding: Data is constructed for relative depth sorting (using DepthAnything V2), absolute depth estimation (from public datasets), and 3D grounding (reformulated into QA pairs).
Video: Data includes general understanding (captioning, QA, action recognition/grounding), temporal grounding/moment retrieval (explicit timestamp prediction), and video streaming data (interleaved caption/QA, proactive reasoning, realtime commentary).
STEM: Data integrates image comprehension (grounding samples, synthetic tables/diagrams, captions, VQA) and problem-solving data (K12 exercises, adult education problems, English image-associated questions), using hybrid acquisition strategies (manual, automated synthesis, quality control).
GUI: Data primarily from UI-TARS covers web, app, and desktop environments with structured metadata. Tasks include element description, dense captioning, state transition captioning (perception), coordinate prediction (grounding), and multi-step trajectories (reasoning).

The VLM pre-training process is structured in three stages, building upon a pre-trained Seed-ViT and an internal 20B active parameter MoE LLM:

Stage 0: Align vision encoder and LLM by training only the MLP adapter (vision encoder and LLM frozen).
Stage 1: All parameters trainable, focusing on knowledge accumulation and visual grounding/OCR on a 3 trillion token multimodal corpus. Small amounts of text-only and instruction-following data are added.
Stage 2: Balanced data mixture across tasks, including new domains (video, coding, 3D), and increased sequence length (up to 131,072). All parameters trainable.

Scaling analysis indicates that the training loss for data sub-categories follows power laws, and there is a log-linear relationship between training loss and downstream evaluation metrics, suggesting that performance can likely improve with increased model size and training compute.

Post-training enhances instruction-following and reasoning through Supervised Fine-tuning (SFT) and Reinforcement Learning (RL).

SFT: Uses curated cold-start data, comprising General Instruction data (concise responses) and Long Chain-of-Thought (LongCoT) data (step-by-step reasoning). Data is constructed through crowdsourcing, curated open-source data, prompt engineering, and rejection sampling using an LLM-as-a-judge and Reward Model. A self-instruct methodology synthesizes complex prompts. Training is done on a combined dataset, freezing the vision encoder but training other parameters.
RLHF & RLVR: Employs Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR). Preference data for RLHF is collected through human annotation (list-wise comparison, 5-scale rating) and synthetic data (ground-truth based correctness evaluation). A VLM is trained as a generative classifier Reward Model. RLVR is used for tasks with verifiable solutions like Visual STEM (math problems verified by matching symbolic expressions), Grounding (IoU of predicted boxes/points), Visual Instruction Following (regex verification), and Visual Puzzles/Games (regex or outcome verification). The "Spot the Differences" game is used as a testbed, training the model to output both natural language descriptions and bounding boxes for differences, using synthetic data generated via inpainting or SVG modification.
Hybrid RL: Utilizes a variant of the PPO algorithm combining RM and verifier rewards. Key implementations include a format reward, hybrid reward signal mixing, a shared critic model initialized from the RM, distinct KL coefficients for different prompt types, and a detailed training recipe involving specific sequence lengths, batch sizes, and sampling strategies. An iterative update strategy uses rejection sampling fine-tuning, where correct responses from the latest RL model on challenging prompts are incorporated into the SFT data for subsequent SFT releases.

The large-scale training requires significant infrastructure innovations. A novel hybrid parallelism approach uses ZeRO data parallelism for the vision encoder/adapter and 4-D parallelism (expert, interleaved pipeline, ZeRO-1, context) for the LLM, addressing the architectural asymmetry. Workload balancing redistributes vision data greedily based on computation intensity, using group-wise balancing. A parallelism-aware data loader minimizes I/O overhead by having only one GPU per pipeline parallelism group load data and filtering unnecessary images before GPU transfer, using prefetching for overlap. Fault tolerance is achieved using the MegaScale framework and ByteCheckpoint for efficient checkpointing. The post-training framework uses a verl-based system with single and multi-controllers, deploying verifiers as isolated services and leveraging efficient techniques for actor/critic updates, rollout generation (vLLM), and reward model inference.

Evaluation on public benchmarks shows Seed1.5-VL's strength across various capabilities. The pre-trained Seed-ViT performs comparably to larger models on zero-shot classification. On image tasks, Seed1.5-VL achieves SOTA in several reasoning (MathVista, VLM are Blind, ZeroBench (sub), VisuLogic), general VQA (RealWorldQA, SimpleVQA, MMStar), document/chart (TextVQA, InfographicVQA, DocVQA), grounding/counting (BLINK, LVIS-MG, VisualWebBench, RefCOCO-avg, CountBench, FSC-147), and 3D spatial understanding benchmarks (DA-2K, NYU-Depth V2, All-Angles Bench), often leading or ranking second to top proprietary models like Gemini 2.5 Pro. Its performance on video tasks is also strong, achieving SOTA in short video (MotionBench, TVBench, Dream-1K, TempCompass) and streaming video (OVBench, OVOBench, StreamBench, StreamingBench proactive) benchmarks, and video grounding (Charades-STA, TACoS). It is competitive in long video but trails some models in video reasoning.

In agent-centric tasks, Seed1.5-VL demonstrates exceptional performance. On GUI grounding benchmarks (ScreenSpot-V2, ScreenSpot-Pro), it outperforms OpenAI CUA and Claude 3.7 Sonnet. In GUI agent tasks across computer, browser, and phone use (OSWorld, Windows Agent Arena, WebVoyager, Online-Mind2Web, AndroidWorld), it consistently leads or ranks second, significantly surpassing other foundation VLMs. For gameplay (14 Poki games), Seed1.5-VL achieves higher scores/levels across multiple games compared to UI-TARS, OpenAI CUA, and Claude 3.7 Sonnet, and shows robust inference-time scaling as interaction rounds increase.

Internal benchmarks, designed to be more challenging and address limitations of public benchmarks (language bias, saturation, evaluation methods), confirm Seed1.5-VL's strong performance, ranking second overall among compared models including Gemini 2.5 Pro and OpenAI models. It excels in OOD, Agent, and Atomic Instruction Following categories, and demonstrates OOD generalization abilities through examples like solving Rebus puzzles, debugging code from images, and generating Mermaid code from diagrams. Its usefulness rate in an internal chatbot is also competitive.

Despite its strengths, Seed1.5-VL has limitations. It struggles with fine-grained visual perception tasks like counting irregularly arranged/occluded objects, identifying subtle differences, and interpreting complex spatial relationships. Higher-level reasoning challenges persist, particularly in tasks trivial for humans but requiring combinatorial search or complex 3D spatial imagination. Temporal and multi-image reasoning also show limitations. Hallucination remains a challenge, with the model sometimes prioritizing learned knowledge over visual input.

In conclusion, Seed1.5-VL is a powerful multimodal model demonstrating SOTA or highly competitive performance across a wide range of vision, video, and agent tasks. Scaling analysis suggests potential for further improvement with increased parameters and compute. Future work will focus on addressing current limitations, including improving robust 3D spatial reasoning, mitigating hallucination, enabling complex combinatorial search via tool use, and exploring unification with image generation for visual Chain-of-Thought.

PDF Markdown

Tweets

https://twitter.com/TsingYoga/status/1922120213274780066

https://twitter.com/AdinaYakup/status/1922239592062976373

https://twitter.com/arankomatsuzaki/status/1922132833499312586

https://twitter.com/_akhaliq/status/1922318168657215519

https://twitter.com/TheTuringPost/status/1925335342301028575

https://twitter.com/_ulivz/status/1938009768775585895

YouTube

Show All Videos

Seed1.5-VL Technical Report (2505.07062v1)

Summary

Related Papers

Tweets

YouTube