Skywork R1V: 38B-Parameter Multimodal VLM
- Skywork R1V is a 38-billion parameter vision-language model that uses a fixed vision encoder and an LLM, bridged by an efficient two-layer MLP adapter.
- It employs a hybrid training strategy combining iterative supervised fine-tuning and Group Relative Policy Optimization to ensure robust multimodal alignment.
- Its adaptive Chain-of-Thought distillation pipeline dynamically controls reasoning depth, enabling state-of-the-art performance on advanced reasoning benchmarks.
Skywork R1V is a 38-billion-parameter vision-LLM (VLM) designed for advanced multimodal reasoning, extending the DeepSeek-R1-distill LLM backbone to visual modalities through an efficient multimodal transfer methodology. The model integrates a lightweight visual projector (multi-layer perceptron, MLP) to inject visual semantics into the LLM embedding space, supporting seamless adaptation without retraining the foundational LLM or the vision encoder. Skywork R1V introduces a hybrid optimization regime combining iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), as well as an adaptive-length Chain-of-Thought (CoT) distillation pipeline for dynamic control over reasoning depth. Experimental results demonstrate Skywork R1V’s strong performance on both textual and multimodal reasoning benchmarks, achieving results previously out of reach for open models with this parameter count. The complete model weights, codebase, and training procedures have been publicly released.
1. Model Architecture and Multimodal Adapter Integration
Skywork R1V builds on modular fusion of pretrained vision and language backbones using an MLP-based cross-modal adapter. The design involves:
- Textual Backbone: , a DeepSeek-R1-distill-Qwen2.5-32B transformer (32B parameters, context window: 16,384 tokens) optimized for chain-of-thought (CoT) reasoning.
- Vision Encoder: , a standard Vision Transformer (ViT) with fixed weights, producing a patch-wise visual embedding (e.g., ).
- Visual Projector: A lightweight, two-layer MLP realizes:
Mapping vision encoder outputs into the LLM embedding space (e.g., ).
- Multimodal Integration: The network , with structurally consistent with , supports swapping in/out the adapter as needed.
This modular design enables reassembly with minimal impact on the extensive text-reasoning capabilities of the LLM, and obviates vision encoder/LLM retraining during adaptation.
2. Multimodal Transfer and Alignment Procedures
Skywork R1V’s adapter and multimodal alignment are accomplished in three primary stages:
- MLP Adapter Initialization: Both and are frozen. The MLP adapter is initially trained on 2M multimodal samples (supervised cross-entropy, learning rate ), then fine-tuned on 200K GPT-4-vetted samples (), and again on 40K CoT examples with the same reduced learning rate.
- Model Re-Assembly: The pretrained MLP () is inserted between and , producing . This exchange ensures preservation (≥98%) of the LLM’s inherent text reasoning ability.
- Modality Alignment: With the LLM and vision encoder frozen, only the adapter is further fine-tuned on chain-of-thought reasoning chains. The alignment procedure leverages the hybrid optimization framework described below to consolidate visual-textual reasoning without catastrophic forgetting.
These steps collectively enable efficient large-scale multimodal transfer with parameter economy and minimal risk of degrading pretrained capabilities.
3. Training Regime: Hybrid Supervised and Reinforcement Optimization
Skywork R1V’s hybrid optimization regime spans both supervised and policy-gradient-based methods:
- Iterative Supervised Fine-Tuning (SFT): Model iterates (, ) are successively optimized on growing, reward-thresholded datasets. For each iteration , data is selected by reward-model threshold and model disagreement with prior state outputs. The SFT loss:
Hyperparameters: context length 16,384, batch 512, warmup ratio 0.03, weight decay 0.05, learning rates (1st iter), (later).
- Group Relative Policy Optimization (GRPO): Final training involves RL with grouped candidates:
- State: image embedding + prompt
- Actions: next token
- Reward: per sample sum of accuracy (1 if correct, else 0) and format compliance (1 if boxed answer, else 0); total .
- Group Baseline: per RM-score group, forming advantages .
- Parameter Update:
with . - Hyperparameters: , temperature 1.0, batch 8 per prompt, max sequence 8k tokens.
This hybrid optimization notably enables robust cross-modal grounding and efficient policy improvement without overfitting to reward idiosyncrasies.
4. Chain-of-Thought Distillation and Reasoning Depth Control
Skywork R1V achieves excellence in multimodal reasoning not only through optimal architecture, but also via an adaptive-length Chain-of-Thought (CoT) distillation corpus:
Quality and Difficulty Assessment (QDAM): GPT-4o assigns each image-text pair a vision score (), text score (), and an integration score (), all normalized to .
Dynamic Reasoning Length Controller (DRLC): The repetition-penalty is set as:
Complex multimodal questions thus elicit longer, more elaborate CoT traces; straightforward items yield shorter solutions.
- CoT Self-Distillation: Samples failing GPT-4o verification are refined and regenerated in an iterative pipeline, filtering for correctness, clarity, and reasoning depth.
This yields a dynamic corpus exhibiting controlled reasoning complexity, central to the model’s high inference efficiency and mitigated overthinking.
5. Benchmark Performance
Skywork R1V achieves strong empirical results:
| Benchmark | QwQ-32B | InternVL2.5-38B | VILA-40B | Skywork R1V (38B) |
|---|---|---|---|---|
| MATH-500 | 90.6 | – | – | 94.0 |
| AIME 2024 | 50.0 | – | – | 72.0 |
| GPQA | 54.5 | – | – | 61.6 |
| MathVista (mini) | 70.5 | 71.9 | 49.5 | 67.5 |
| MMMU (Val) | 64.5 | 63.9 | 55.1 | 69.0 |
Skywork R1V matches or surpasses prominent closed models, e.g., equaling GPT-4o (69.1) and exceeding Claude-3.5-Sonnet (66.4) on MMMU. On MathVista, R1V (67.5) outpaces Claude-3.5-Sonnet (65.3). These observations confirm that substantial multimodal reasoning can be attained at the 38B parameter scale when paired with the correct architectural and optimization choices.
6. Implementation Characteristics
Key properties and resource requirements of Skywork R1V include:
Training Regimen:
- SFT: learning rates , batch 512, context 16,384, epochs 1, warmup 0.03, weight decay 0.05.
- RL/GRPO: learning rate , temperature 1.0, batch 8 per prompt, max sequence 8k tokens.
- Inference Scalability:
- Sequences up to 64k tokens are supported.
- Inference on full context requires ≈150 GB GPU memory.
- Open Availability:
- All model weights (MLP adapter + full checkpoints) are openly released under Apache 2.0.
- Accompanying training and inference code is available under MIT license at https://huggingface.co/Skywork/Skywork-R1V-38B.
These aspects facilitate reproducibility and deployment both in academic and production settings.
7. Context, Limitations, and Evolution
Skywork R1V establishes a design and training paradigm—MLP adapter-based multimodal transfer, hybrid SFT/RL optimization, and adaptive CoT distillation—that is retained and elaborated upon in later successors such as Skywork R1V2 and R1V3. While Skywork R1V achieves strong cross-modal reasoning and OOD performance with efficient parameter usage, subsequent versions address phenomena such as vanishing advantages (via Selective Sample Buffer), integrate hybrid reward-model and rule-based feedback, and more assertively calibrate visual rewards to suppress visual hallucination-induced failures. The introduction of connector-module tuning and entropy-based checkpointing in R1V3 further demonstrates the pivotal role of cross-modal alignment layers in open VLMs.
A plausible implication is that the R1V adapter/RL pipeline framework, with its modular transfer approach and highly controlled reasoning trace generation, is likely to remain a reference standard for mid-scale open-source multimodal reasoning systems.