Papers
Topics
Authors
Recent
2000 character limit reached

Skywork-R1V4: 30B Agentic Multimodal Model

Updated 6 December 2025
  • Skywork-R1V4 is a state-of-the-art 30 billion parameter multimodal model that integrates planning, visual manipulation, and deep web search to achieve advanced long-horizon reasoning.
  • The model features a modular architecture with a frozen visual encoder, a large Transformer language model, and a lightweight planning head for dynamic alternation between image operations and search queries.
  • Supervised fine-tuning on fewer than 30,000 high-quality, execution-validated planning trajectories enables precise stepwise execution and emergent compositional reasoning.

Skywork-R1V4 is a 30 billion parameter (30B, A3B) agentic multimodal model that unifies multimodal planning, active visual thinking, deep web search, and interleaved stepwise reasoning. Unlike prior approaches relying on large-scale reinforcement learning, Skywork-R1V4 is trained exclusively via supervised fine-tuning on fewer than 30,000 high-quality, execution-validated planning trajectories. The paradigm is characterized by the orchestration of “thinking with images” (code-driven visual manipulation) and external “DeepSearch” (multimodal retrieval), with dynamic alternation between visual and search-based reasoning. The model achieves state-of-the-art results on major benchmarks, emergent long-horizon planning, and demonstrates agentic multimodal intelligence without reinforcement learning (Zhang et al., 2 Dec 2025).

1. Model Architecture and Design

Skywork-R1V4 is built upon a 30B parameter A3B backbone, comparable in scale to Qwen3-VL 30B, and features an integrated, lightweight planning head. The architecture consists of the following modules:

  • Visual Encoder: ViT-style encoder, 12 layers, hidden size Hv=1024H_v=1024, with Nv=16×16N_v=16 \times 16 image patches. The encoder is frozen, projecting each patch to a 128-dimensional embedding.
  • Transformer LLM (LM): 64 layers, hidden size Ht=12288H_t=12288, 96 attention heads, and rotary position embeddings. Total LM parameters: PLML(4Ht2+2Ht2/A)28×109P_{\rm LM} \approx L\cdot (4H_t^2+2H_t^2/A)\approx 28\times10^9.
  • Cross-modal Fusion: Inserted every 4 Transformer blocks; computes

Z=softmax(XWQ(YWK)Ht)YWVZ = \mathrm{softmax}\left(\frac{XW_Q(YW_K)^\top}{\sqrt{H_t}}\right)YW_V

where XX and YY are visual and text embeddings, WQ,WK,WVRHt×HtW_Q,W_K,W_V\in\mathbb R^{H_t\times H_t}; ZZ is injected via residual addition.

  • Planner Head: 12-layer Transformer, hidden size Hp=2048H_p=2048, 32 heads, mapping context into structured (tool, parameters) sequences. Planner parameters Pplan0.8×109P_{\rm plan} \approx 0.8\times10^9.
  • Parameter Summary:

| Component | Layers/Heads | Hidden Size | Parameter Count | |---------------|------------------|------------------|---------------------| | Visual Encoder| 12 | 1024 | PencP_{\rm enc} (fixed)| | LM | 64 / 96 | 12288 | 28×109\sim 28\times10^9 | | Planner | 12 / 32 | 2048 | 0.8×1090.8\times10^9 | | Total | - | - | 30×10930\times10^9 |

This modular structure enables tight cross-modal integration and hierarchical planning.

2. Training Dataset and Supervised Fine-Tuning

The model was trained on fewer than 30,000 curated multimodal agentic trajectories, partitioned as follows:

  • Think-With-Image rollouts: ≈8 K samples
  • Basic Search (FVQA/MMSearch) trajectories: ≈7 K samples
  • Enhanced Search (graph-walk queries): ≈5 K samples
  • Interleaved Image + Search (LiveVQA): ≈3 K samples
  • Planner demonstrations: ≈6 K samples

All training samples undergo step-wise consistency filtering: each trajectory is accepted only if every action executes cleanly (e.g., code runs without error and produces the intended observation), and the final answer matches ground truth as assessed by an external judge (o3-mini). The filtering is formalized as:

1
2
3
4
5
6
function FilterTrajectory(traj):
  for step in traj:
    if not StepExecutesCleanly(step.code): return False
    if not ConsistentObservation(step, step.next_think): return False
  if JudgeModel(final_answer)  ground_truth: return False
  return True

Supervised fine-tuning is performed using standard cross-entropy over the joint token stream (text and tool calls):

L(θ)=t=1Tlogpθ(wtw<t,I)\mathcal L(\theta) = -\sum_{t=1}^T \log p_\theta(w_t \mid w_{<t}, I)

with w1:Tw_{1:T} as targets and II as the input image embedding.

3. Interleaved Reasoning and Controller Design

Interleaved reasoning is the process by which Skywork-R1V4 alternates between visual manipulation (“thinking with images”) and various deep web-based retrieval operations. At each reasoning step, the planner head determines, via learned binary gating, whether to execute a visual operation or initiate a search query. The process can be represented as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Algorithm InterleaveController:
  Input: image I, question Q
  State  {I, Q}, step=1
  while not Done(State):
    if NeedsImageOp(State):
      action  PlanImageOp(State)
      observation  ExecuteCode(action.code)
    else if NeedsSearch(State):
      action  PlanSearch(State)
      observation  CallTool(action.tool, action.args)
    else:
      break
    State  UpdateState(State, action, observation)
    step += 1
  answer  GenerateAnswer(State)
  return trajectory, answer

This dynamic alternation is key to solving long-horizon, complex agentic tasks.

4. Tool Library and Execution Traces

The integrated toolset includes:

  • <code>: Python-based image operations (crop, zoom, contrast, rotate) executed in a sandbox.
  • <image_search>: Google Lens reverse-image search.
  • <text_search>: Web search via Serper API.
  • <web_visit>: Full-page retrieval and summarization using Qwen3-32B.

Execution is organized as a stepwise trajectory of > and <tool_call> pairs, with observations updating model state at each step. A representative 10+ step trace is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<think>Focus on whole image to locate title card</think>
<tool_call>{"name":"code","arguments":{"code":"crop(I,0,0,1024,256)"}}</tool_call>
<observation>…cropped image…</observation>
<think>Text is too small—enhance contrast</think>
<tool_call>{"name":"code","arguments":{"code":"contrast(crop1,1.5)"}}</tool_call>
<observation>…enhanced image…</observation>
<think>Identify object via reverse image search</think>
<tool_call>{"name":"image_search","arguments":{"image_paths":["enhanced1.png"]}}</tool_call>
<observation>…search results…</observation>
<think>Extract candidate keywords</think>
<tool_call>{"name":"text_search","arguments":{"queries":["\"Fall Detection\" smartwatch crown"]}}</tool_call>
<observation>…web snippets…</observation>
<think>Confirm with authoritative site</think>
<tool_call>{"name":"web_visit","arguments":{"urls":["https://example.com/sos-feature"]}}</tool_call>
<observation>…page summary…</observation>
<think>Combine visual and web info to answer</think>
<tool_call>{"name":"none","arguments":{}}</tool_call>
<answer>Pressing the crown five times launches Emergency SOS.</answer>

Each <think> acts as a gating point for the planner, incorporating cumulative action-observation history.

5. Empirical Results and Benchmarks

Skywork-R1V4 achieves state-of-the-art results on multimodal search and reasoning benchmarks, surpassing Gemini 2.5 Flash on all primary metrics:

Task Skywork-R1V4 Qwen3-VL 30B Gemini 2.5 Flash
MMSearch 66.1 (+47.4) 18.7 64.9
FVQA 67.2 (+13.9) 53.3 60.7
BrowseComp-VL 38.4 (+8.4) 30.0 40.8

Ablation results indicate:

  • Excluding interleaved trajectories (using only single-mode search or image reasoning) degrades MMSearch scores by approximately 20 points.
  • Omission of planning-execution grounded training reduces FVQA performance by 15 points.

This evidences the importance of joint, tool-grounded, and interleaved data in achieving robust multimodal reasoning.

6. Emergent Long-Horizon Reasoning

Despite being trained on trajectories of only 2–6 steps, the model demonstrates emergent capacity for >10-step planning and execution during inference. In LiveVQA, the model sequentially orchestrates image crops, central zoom, contrast enhancement, multiple rounds of image and text search, sub-region operations, full web page access, and reflection, culminating in answer generation. This multi-step tool invocation behavior is not directly seen in any single training trajectory, suggesting compositional generalization emerges from the agentic supervised framework.

7. Limitations and Future Directions

Skywork-R1V4 demonstrates that supervised, tool-grounded trajectory learning can rival or surpass RL-trained agentic models in both perception and multimodal search, offering lower computational requirements, more stable training, and improved reproducibility. However, several constraints persist:

  • The reliance on curated trajectories restricts tool diversity (e.g., semantic segmentation, depth reasoning).
  • The model exhibits no online adaptation; error correction is reliant on learned patterns, not environment feedback.
  • Real-time deployment is hindered by tool invocation latency and external API availability.

Future research avenues proposed include:

  • Extension of the toolset to cover more complex actions (e.g., DOM navigation, semantic segmentation, vector-DB interfaces).
  • Hybrid training frameworks that combine supervised fine-tuning with on-policy RL for refinement of rare or long-horizon capabilities.
  • Integration of episodic memory or belief state modules to enhance coherence and accuracy over extended action sequences exceeding 20 steps.

Skywork-R1V4’s agentic framework, integrating transformer-based planning, cross-modal fusion, and sequenced tool use, provides evidence for the efficacy of carefully filtered supervised data in constructing sophisticated agentic multimodal models without reinforcement learning (Zhang et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Skywork-R1V4.