Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pelican-VL 1.0: Embodied Vision-Language Models

Updated 3 July 2026
  • Pelican-VL 1.0 is a family of open-source embodied vision-language models that combine perception, reasoning, planning, action, and interaction in physical environments.
  • It employs a DPPO training loop that alternates reinforcement learning with supervised fine-tuning to diagnose weaknesses and enhance embodied capabilities.
  • The model achieves significant performance gains on embodied benchmarks, with up to a 20.3% improvement over base models and parity with leading systems.

Pelican-VL 1.0 is a family of open-source embodied vision-language foundation models for embodied intelligence, spanning 7B to 72B parameters and presented as a general-purpose “brain” for systems that must couple perception, reasoning, planning, action, and interaction in the physical world. It is described as a foundation brain model whose mission is “to embed powerful intelligence into various embodiments,” and its training is organized around Deliberate Practice Policy Optimization (DPPO), a metacognitive metaloop that alternates reinforcement learning with supervised fine-tuning in order to diagnose weaknesses and remediate them under sparse, finite embodied data (Zhang et al., 30 Oct 2025, Zhang et al., 20 Nov 2025).

1. Definition and intended scope

Pelican-VL 1.0 is not presented as a narrow manipulation policy or a single robotic controller. It is framed as an embodied multimodal model family intended to serve as a general cognitive substrate for embodied systems, including settings that require perception, scene understanding, cross-modal reasoning, task planning, physical interaction, and human instruction following (Zhang et al., 30 Oct 2025). In the method description, the family is built on top of a vision-language backbone; the report states that it trains Qwen2.5-VL and experimentally highlights a 72B Pelican-VL 1.0 model, while the release covers Pelican-VL 1.0 models from 7B to 72B together with the complete DPPO pipeline (Zhang et al., 20 Nov 2025).

The scope of “embodied intelligence” in this context is broader than static image-question answering. The target capabilities include physical-world reasoning and effective action under multimodal inputs. The family is therefore positioned at the intersection of VLM post-training, embodied reasoning, and policy improvement, rather than as a conventional offline imitation model.

A recurrent theme in the description is that embodied competence should not be reduced to scaling internet-style multimodal understanding. Pelican-VL 1.0 is proposed as a response to gaps in spatial reasoning, temporal-causal reasoning, affordance understanding, and decision/planning that remain visible in general VLMs.

2. Data regime, curation logic, and capability coverage

The training data pipeline begins from a raw multimodal pool of 4+ billion tokens comprising 231M images, 29k hours of video, 231M open-ended QA pairs, 9M grounding annotations, and 2M multiple-choice questions. From this pool, the reported sampling selects 1.3M instances for SFT and 0.5M instances for RL (Zhang et al., 30 Oct 2025). The data are organized into four embodied capability areas: physical, spatial, and numerical reasoning; perception, grounding, and multi-object consistency; temporal, functional, and scene understanding; and decision making and task planning.

A complementary benchmark-facing capability taxonomy defines nine embodied dimensions: Physical Causal Reasoning, Perception Object Grounding, Quantitative Numerical Reasoning, Spatial Geometric Reasoning, Temporal Sequential Reasoning, Affordance Function Reasoning, Multi-Object Scene Consistency, Scene Action Understanding, and Decision Task Planning. Using this taxonomy, 27,667 samples from public embodied datasets are re-annotated to align evaluation and training analysis with embodied capability structure (Zhang et al., 20 Nov 2025).

The curation pipeline is adaptive rather than static. The reported metaloop data-selection process includes rollout logging, difficulty-aware sampling, rule-based filtering, format unification, model-based scoring using Qwen3VL-Plus and InternVL3.5-38B, voting-based selection, and random human review (Zhang et al., 30 Oct 2025). This is intended to address the claim that embodied data are scarce, expensive, hard to collect, and often imbalanced toward easier or more common skills.

A concrete example is the processing of SpatialVID from raw video rather than retained expert annotations. The procedure uses Qwen3VL-Plus to generate 24 spatial QA pairs for each of 75k videos, filters them with InternVL3.5-38B by dual answering, retains or replaces answers according to agreement patterns, obtains 14k QAs, then augments them with a 19k QA subset from InternSpatial, yielding a final 33k-QA curated set across tasks such as object count, object size, relative distance, absolute distance, appearance order, room size, relative direction, and route plan (Zhang et al., 30 Oct 2025). This suggests that Pelican-VL 1.0 treats data quality control as a core component of embodied training rather than an auxiliary preprocessing step.

3. DPPO and the metaloop training mechanism

DPPO, or Deliberate Practice Policy Optimization, is the central training framework. It is described as a metacognitive “Metaloop” inspired by human deliberate practice and operationalized as an RL-Refine-Diagnose-SFT cycle, or more generally as alternating reinforcement learning for weakness revelation and supervised fine-tuning for weakness refinement (Zhang et al., 20 Nov 2025). The stated motivation is that RL alone can over-explore, collapse, or become unstable, while SFT alone cannot discover new weaknesses.

The training loop proceeds through rollout generation, difficulty identification, and targeted remediation. The practical mechanics are reported as: start with initial model parameters; use RL rollouts to diagnose weaknesses; select difficult or failed cases; rebalance the RL dataset; train RL until stagnation; build the SFT dataset from weak cases, related embodied samples, and general replay data; fine-tune with SFT; reset the buffer; and repeat (Zhang et al., 20 Nov 2025). Within the RL phase, the reported steps include generating rollouts with the current policy, scoring each sample by SuccessRate, building a difficulty-aware RL dataset, training with GRPO-style RL, tracking task stagnation, and stopping the RL phase when progress plateaus.

The generic multimodal formulation is

y^=fθ(xv,xt),\hat{y} = f_\theta(x_v, x_t),

with training objective

L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].

The metaloop alternates phase-wise between RL and SFT, and the embodied task outputs may include reasoning traces, action plans, or tool calls (Zhang et al., 30 Oct 2025).

Difficulty awareness is made explicit. One reported definition is

D(τ)=1SuccessRate(τ),D(\tau) = 1 - \text{SuccessRate}(\tau),

so higher difficulty corresponds to lower task success. The RL stopping rule is tied to a saturation measure,

TS(t)=1TiTTSi(t),\mathrm{TS}(t) = \frac{1}{|\mathcal{T}|}\sum_{i \in \mathcal{T}} \mathrm{TS}_i(t),

with RL stopping when

TS(t)0.7.\mathrm{TS}(t) \ge 0.7.

A related formulation uses task stagnation and likewise moves the system from RL to SFT once progress has plateaued (Zhang et al., 30 Oct 2025, Zhang et al., 20 Nov 2025).

4. Unified preference-learning interpretation

A major theoretical claim is that SFT and RL can be unified under a preference-learning objective:

θ=argmaxθEcDpref[logP(cπθ)].\theta^* = \arg \max_{\theta} \mathbb{E}_{c\sim D_{pref}}[\log P(c|\pi_\theta)].

Here, DprefD_{pref} is a preference dataset and cc is a preference sample. The distinction between SFT and GRPO-style RL is then placed in the form of the preference sample: for SFT, it is a positive expert trajectory; for RL, it is a ranked set of trajectories (Zhang et al., 20 Nov 2025).

The supervised special case is written as

logP(τπθ)=(c,a)τlogπθ(ac),\log P(\tau^*|\pi_\theta) = \sum_{(c,a^*) \in \tau^*} \log \pi_\theta(a^*|c),

which recovers standard maximum-likelihood training over expert demonstrations. The report also gives the SFT-stage objective as

LSFT(θ)=E(x,y)DSFT[(x,y)τlogπθ(yx)],\mathcal{L}_{SFT}(\theta) = -\mathbb{E}_{(x,y)\sim D_{SFT}}\left[\sum_{(x,y)\in \tau^*} \log \pi_\theta(y|x)\right],

with curated supervision

L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].0

This decomposition formalizes the deliberate-practice idea that weak cases, associated embodied samples, and broader generated data should be combined rather than learned in isolation (Zhang et al., 30 Oct 2025).

The RL side is expressed through a GRPO-based policy-gradient form:

L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].1

where L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].2 is a normalized reward weight derived from rule-based scoring against a reference policy. The composite rollout reward is reported as a weighted combination of structure or format validity and task-specific embodied correctness, and the task rewards cover affordance reasoning, counting and distance estimation, causal and temporal reasoning, task success evaluation, task planning, and task prediction (Zhang et al., 30 Oct 2025, Zhang et al., 20 Nov 2025).

For ranked trajectories L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].3, the preference model uses a Plackett-Luce-style form,

L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].4

with implicit reward

L(θ)=E(xv,xt,y)[(fθ(xv,xt),y)].\mathcal{L}(\theta) = \mathbb{E}_{(x_v, x_t, y)} \Big[ \ell(f_\theta(x_v, x_t), y) \Big].5

This theoretical framing is used to justify alternation rather than separation of RL and SFT: RL exposes weak modes, while SFT consolidates competence on those modes.

5. Training setup and systems-scale implementation

The reported training setup is explicitly large scale: 1000+ A800 GPUs and 50k+ A800 GPU-hours per checkpoint (Zhang et al., 30 Oct 2025). The system uses Context Parallelism for long-context multimodal training, a modified VERL-style training pipeline, and mixed-batch multimodal training that includes text-only, image-text, and video-text data within the same system.

Training is organized into three metaloops, each with an RL phase followed by an SFT phase. The first loop uses video segments shorter than 32 seconds, and the second relaxes the temporal horizon to video segments shorter than 64 seconds. Each clip is sampled up to 32 frames per episode, and RL rollouts can contain up to 16 time steps (Zhang et al., 30 Oct 2025). This progressively increases temporal context as the policy improves.

The computational design supports the broader claim that Pelican-VL 1.0 is not merely a fine-tuned static VLM. Its post-training loop is intended to make better use of limited embodied data by dynamically reallocating compute toward hard or weakly mastered cases. A plausible implication is that the reported systems design is part of the algorithmic story, not only an engineering detail, because the method explicitly couples rollout diagnostics, curated replay, and phase switching.

6. Benchmark suite and reported results

Pelican-VL 1.0 is evaluated on a broad suite of embodied and reasoning benchmarks, including MVBench, EgoSchema, RoboSpatial, BLINK, PhyX, OmniSpatial, Where2Place, EmbSpatialBench, RefSpatialBench, ERQA, COSMOS, and VSI-Bench (Zhang et al., 20 Nov 2025). These benchmarks cover general video understanding, spatial reasoning, physical reasoning, affordance and grounding, temporal reasoning, and task planning.

The headline empirical claims are a 20.3% performance improvement over the base model and a 10.6% advantage over open-source models at the 100B-parameter scale (Zhang et al., 20 Nov 2025). A related report describes Pelican-VL 1.0 as achieving a 20.3% uplift from its base model, outperforming 100B-level open-source counterparts by 10.6%, and reaching parity with leading proprietary systems on well-known embodied benchmarks (Zhang et al., 30 Oct 2025).

The same-budget 7B comparison reports Base average 33.5, RL average 40.7, SFT average 39.9, and “Our” average 51.0, with DPPO achieving +15.8% over RL in a forgetting comparison setting while reducing performance drop on unseen datasets (Zhang et al., 20 Nov 2025). This is the clearest ablation evidence that the alternation of RL and SFT is stronger than either component alone under matched budget.

For models with parameters less than or equal to 100B, Pelican-VL 72B reports an overall average of 63.8. Reported benchmark scores include MVBench 69.7, RoboSpatial 61.1, PhyX 86.4, Where2Place 64.0, EgoSchema 79.3, RefSpatialBench 49.5, COSMOS 68.5, and VSI-Bench 57.3 (Zhang et al., 20 Nov 2025). The broader training discussion also highlights +25.7% spatial understanding and +15.1% temporal reasoning relative to the backbone model (Zhang et al., 30 Oct 2025).

The model is also evaluated on the Berkeley Function-Calling Leaderboard, where Pelican-VL 1.0 achieves 46.0 overall accuracy (Zhang et al., 30 Oct 2025). Within the reported framing, this is used to argue that embodied post-training does not preclude tool-use competence.

7. Significance, limitations, and naming distinctions

The open-source release is emphasized as part of the contribution: Pelican-VL 1.0 models from 7B to 72B, the complete DPPO code and pipeline, inference code, SFT/LoRA training code, and base model checkpoints are all described as public artifacts (Zhang et al., 30 Oct 2025, Zhang et al., 20 Nov 2025). The release is presented as the first systematic framework that addresses both data and resource bottlenecks while enabling the community to build versatile embodied agents efficiently.

The stated strengths are the combination of massive multimodal scale, high-quality curated embodied data, and an adaptive post-training loop that discovers weaknesses and remediates them. Reported application demonstrations include tactile grasping, zero-shot manipulation, multi-robot collaboration, and long-horizon planning (Zhang et al., 30 Oct 2025). At the same time, the limitations are also explicit: embodied benchmarks are still coarse-grained; many datasets are imbalanced toward spatial reasoning and underrepresent physical causality, affordance reasoning, and long-horizon planning; and progress remains dependent on large compute and heavy engineering (Zhang et al., 30 Oct 2025).

A common source of confusion is nomenclature. Pelican-VL 1.0 should be distinguished from “Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification,” which is a post-hoc hallucination detection and correction framework for LVLM outputs rather than an embodied brain model (Sahu et al., 2024). That earlier Pelican system operates by transforming an LVLM answer into a visual claim, decomposing it into predicate-based sub-claims, verifying them with tool-augmented Program-of-Thought code, and then using an LLM-based synthesis step to judge correctness; it addresses hallucination in visual instruction following, not embodied policy improvement (Sahu et al., 2024).

Taken together, Pelican-VL 1.0 is best understood as both a model family and a training doctrine. The model family supplies an open-source embodied multimodal backbone at 7B–72B scale, while DPPO supplies the deliberate-practice mechanism by which weak embodied capabilities are exposed, curated, and refined under sparse embodied data (Zhang et al., 20 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pelican-VL 1.0.