Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Skywork R1V: 38B-Parameter Multimodal VLM

Updated 9 November 2025
  • Skywork R1V is a 38-billion parameter vision-language model that uses a fixed vision encoder and an LLM, bridged by an efficient two-layer MLP adapter.
  • It employs a hybrid training strategy combining iterative supervised fine-tuning and Group Relative Policy Optimization to ensure robust multimodal alignment.
  • Its adaptive Chain-of-Thought distillation pipeline dynamically controls reasoning depth, enabling state-of-the-art performance on advanced reasoning benchmarks.

Skywork R1V is a 38-billion-parameter vision-LLM (VLM) designed for advanced multimodal reasoning, extending the DeepSeek-R1-distill LLM backbone to visual modalities through an efficient multimodal transfer methodology. The model integrates a lightweight visual projector (multi-layer perceptron, MLP) to inject visual semantics into the LLM embedding space, supporting seamless adaptation without retraining the foundational LLM or the vision encoder. Skywork R1V introduces a hybrid optimization regime combining iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), as well as an adaptive-length Chain-of-Thought (CoT) distillation pipeline for dynamic control over reasoning depth. Experimental results demonstrate Skywork R1V’s strong performance on both textual and multimodal reasoning benchmarks, achieving results previously out of reach for open models with this parameter count. The complete model weights, codebase, and training procedures have been publicly released.

1. Model Architecture and Multimodal Adapter Integration

Skywork R1V builds on modular fusion of pretrained vision and language backbones using an MLP-based cross-modal adapter. The design involves:

  • Textual Backbone: ff_{\ell}, a DeepSeek-R1-distill-Qwen2.5-32B transformer (32B parameters, context window: 16,384 tokens) optimized for chain-of-thought (CoT) reasoning.
  • Vision Encoder: fvf_v, a standard Vision Transformer (ViT) with fixed weights, producing a patch-wise visual embedding hRdvh \in \mathbb{R}^{d_v} (e.g., dv=1024d_v = 1024).
  • Visual Projector: A lightweight, two-layer MLP θ\theta realizes:

u=GELU(W1h+b1)Rdm,p=W2u+b2Rdu = \mathrm{GELU}(W_1 h + b_1) \in \mathbb{R}^{d_m},\qquad p = W_2 u + b_2 \in \mathbb{R}^{d_{\ell}}

Mapping vision encoder outputs into the LLM embedding space (e.g., dm=2048,d=4096d_m = 2048, d_{\ell} = 4096).

  • Multimodal Integration: The network M=fs(θ(fv(x)))M' = f_{\ell}^{s}(\theta(f_v(x))), with fsf_{\ell}^s structurally consistent with ff_{\ell}, supports swapping in/out the adapter as needed.

This modular design enables reassembly with minimal impact on the extensive text-reasoning capabilities of the LLM, and obviates vision encoder/LLM retraining during adaptation.

2. Multimodal Transfer and Alignment Procedures

Skywork R1V’s adapter and multimodal alignment are accomplished in three primary stages:

  1. MLP Adapter Initialization: Both fvf_v and fsf_{\ell}^s are frozen. The MLP adapter θ\theta is initially trained on 2M multimodal samples (supervised cross-entropy, learning rate 2×1042 \times 10^{-4}), then fine-tuned on 200K GPT-4-vetted samples (4×1054 \times 10^{-5}), and again on 40K CoT examples with the same reduced learning rate.
  2. Model Re-Assembly: The pretrained MLP (θ0\theta_0) is inserted between fvf_v and ff_{\ell}, producing M=fθ0fvM = f_{\ell} \circ \theta_0 \circ f_v. This exchange ensures preservation (≥98%) of the LLM’s inherent text reasoning ability.
  3. Modality Alignment: With the LLM and vision encoder frozen, only the adapter is further fine-tuned on chain-of-thought reasoning chains. The alignment procedure leverages the hybrid optimization framework described below to consolidate visual-textual reasoning without catastrophic forgetting.

These steps collectively enable efficient large-scale multimodal transfer with parameter economy and minimal risk of degrading pretrained capabilities.

3. Training Regime: Hybrid Supervised and Reinforcement Optimization

Skywork R1V’s hybrid optimization regime spans both supervised and policy-gradient-based methods:

  • Iterative Supervised Fine-Tuning (SFT): Model iterates (M0,,MTM_0, \dots, M_T, T=4T = 4) are successively optimized on growing, reward-thresholded datasets. For each iteration tt, data is selected by reward-model threshold τt\tau_t and model disagreement with prior state outputs. The SFT loss:

LSFT(θ)=E(x,y)Dti=1ylogPθ(yix,y<i)\mathcal{L}_\mathrm{SFT}(\theta) = - \mathbb{E}_{(x,y) \in \mathcal{D}_t} \sum_{i=1}^{|y|} \log P_\theta(y_i \mid x, y_{<i})

Hyperparameters: context length 16,384, batch 512, warmup ratio 0.03, weight decay 0.05, learning rates 1×1041\times10^{-4} (1st iter), 2×1052\times10^{-5} (later).

  • Group Relative Policy Optimization (GRPO): Final training involves RL with grouped candidates:
    • State: image embedding + prompt
    • Actions: next token
    • Reward: per sample sum of accuracy (1 if correct, else 0) and format compliance (1 if boxed answer, else 0); total r{0,1,2}r \in \{0, 1, 2\}.
    • Group Baseline: bg=E[rg]b_g = \mathbb{E}[r \mid g] per RM-score group, forming advantages Ag(s,a)=r(s,a)bgA^g(s,a) = r(s,a) - b_g.
    • Parameter Update:

    θθ+ηt=1TA+g(st,at)θlogπθ(atst)\theta \leftarrow \theta + \eta \sum_{t=1}^T A_+^g(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t \mid s_t)

    with A+g(s,a)=max(0,Ag(s,a))A^g_+(s,a) = \max(0, A^g(s,a)). - Hyperparameters: lr=1×106\mathrm{lr} = 1 \times 10^{-6}, temperature 1.0, batch 8 per prompt, max sequence 8k tokens.

This hybrid optimization notably enables robust cross-modal grounding and efficient policy improvement without overfitting to reward idiosyncrasies.

4. Chain-of-Thought Distillation and Reasoning Depth Control

Skywork R1V achieves excellence in multimodal reasoning not only through optimal architecture, but also via an adaptive-length Chain-of-Thought (CoT) distillation corpus:

  • Quality and Difficulty Assessment (QDAM): GPT-4o assigns each image-text pair a vision score (SvS_v), text score (StS_t), and an integration score (SIS_I), all normalized to [0,1][0,1].

  • Dynamic Reasoning Length Controller (DRLC): The repetition-penalty PP is set as:

P=min[2,exp(α(1S^v+βS^t+γS^I1+β+γ))]P = \min\left[2, \exp\left(\alpha \left(1 - \frac{\hat{S}_v + \beta \hat{S}_t + \gamma \hat{S}_I}{1 + \beta + \gamma}\right)\right)\right]

Complex multimodal questions thus elicit longer, more elaborate CoT traces; straightforward items yield shorter solutions.

  • CoT Self-Distillation: Samples failing GPT-4o verification are refined and regenerated in an iterative pipeline, filtering for correctness, clarity, and reasoning depth.

This yields a dynamic corpus exhibiting controlled reasoning complexity, central to the model’s high inference efficiency and mitigated overthinking.

5. Benchmark Performance

Skywork R1V achieves strong empirical results:

Benchmark QwQ-32B InternVL2.5-38B VILA-40B Skywork R1V (38B)
MATH-500 90.6 94.0
AIME 2024 50.0 72.0
GPQA 54.5 61.6
MathVista (mini) 70.5 71.9 49.5 67.5
MMMU (Val) 64.5 63.9 55.1 69.0

Skywork R1V matches or surpasses prominent closed models, e.g., equaling GPT-4o (69.1) and exceeding Claude-3.5-Sonnet (66.4) on MMMU. On MathVista, R1V (67.5) outpaces Claude-3.5-Sonnet (65.3). These observations confirm that substantial multimodal reasoning can be attained at the 38B parameter scale when paired with the correct architectural and optimization choices.

6. Implementation Characteristics

Key properties and resource requirements of Skywork R1V include:

  • Training Regimen:

    • SFT: learning rates 2×1044×1052 \times 10^{-4} \rightarrow 4 \times 10^{-5}, batch 512, context 16,384, epochs 1, warmup 0.03, weight decay 0.05.
    • RL/GRPO: learning rate 1×1061 \times 10^{-6}, temperature 1.0, batch 8 per prompt, max sequence 8k tokens.
  • Inference Scalability:
    • Sequences up to 64k tokens are supported.
    • Inference on full context requires ≈150 GB GPU memory.
  • Open Availability:

These aspects facilitate reproducibility and deployment both in academic and production settings.

7. Context, Limitations, and Evolution

Skywork R1V establishes a design and training paradigm—MLP adapter-based multimodal transfer, hybrid SFT/RL optimization, and adaptive CoT distillation—that is retained and elaborated upon in later successors such as Skywork R1V2 and R1V3. While Skywork R1V achieves strong cross-modal reasoning and OOD performance with efficient parameter usage, subsequent versions address phenomena such as vanishing advantages (via Selective Sample Buffer), integrate hybrid reward-model and rule-based feedback, and more assertively calibrate visual rewards to suppress visual hallucination-induced failures. The introduction of connector-module tuning and entropy-based checkpointing in R1V3 further demonstrates the pivotal role of cross-modal alignment layers in open VLMs.

A plausible implication is that the R1V adapter/RL pipeline framework, with its modular transfer approach and highly controlled reasoning trace generation, is likely to remain a reference standard for mid-scale open-source multimodal reasoning systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Skywork R1V.