JARVIS-VLA: Vision-Language-Action Agent

Updated 2 January 2026

JARVIS-VLA is a vision-language-action paradigm that integrates visual perception, linguistic understanding, and control to follow human instructions in open-world settings.
It employs a three-stage pipeline—language post-training, vision-language alignment, and imitation learning—to enhance decision-making and spatial reasoning.
Evaluations in Minecraft demonstrate a ~40% improvement over baselines, setting new benchmarks across over 1,000 atomic tasks.

JARVIS-VLA is a Vision-Language-Action agent paradigm that advances decision-making in open-world environments by post-training large-scale vision-LLMs (VLMs) with visual and linguistic supervision prior to imitation learning. The result is agents that integrate perception, world knowledge, and action at a level beyond that achieved by pure trajectory-based policy learning. JARVIS-VLA is demonstrated within the Minecraft environment, where it sets state-of-the-art benchmarks for following natural human instructions across over 1,000 distinct atomic tasks, including crafting, smelting, cooking, mining, and entity interaction (Li et al., 20 Mar 2025).

1. Training Pipeline: Visual Language Post-Training

JARVIS-VLA implements a three-stage pipeline integrating sequential off-trajectory post-training before action imitation:

Stage I – Language Post-Training: The language transformer is unfrozen and trained on domain-specific text (e.g., Minecraft wiki, Reddit posts) using supervised next-token prediction:

$\mathcal{L}_\mathrm{SFT} = -\sum_{i=1}^N \log p_\theta(x_i | x_{<i}, x_v, x_\text{ins})$

where $x_v$ encodes visual context, $x_\text{ins}$ the textual instruction, and $x_i$ the token in the response or action sequence. The ViT and image adapter remain frozen at this stage.

Stage II – Vision–Language Alignment & Spatial Grounding: All model parameters (ViT, image adapter, language transformer) are trainable. The dataset combines image captioning, visual question answering (VQA), and spatial grounding (point and bounding-box annotation) tasks using the same $\mathcal{L}_\mathrm{SFT}$ objective. This stage enhances both semantic alignment and localization capacity.
Stage III – Imitation Learning on Trajectories: The agent is exposed to human and agent-generated action trajectories. For control, the tokenizer’s reserved tokens are repurposed: 22 for mouse actions, 29 for keyboard inputs (mu-law discretization for continuous actions, discrete tokens for key presses and GUI slots). Only the language transformer is fine-tuned:

$\mathcal{L}_\mathrm{IL} = -\sum_{t=1}^T \log \pi_\theta(a_{t:t+\tau} | o_t, x_\text{ins})$

This modular approach sequentially enhances world knowledge, vision–language alignment, and action capacity (Li et al., 20 Mar 2025).

2. Model Architecture and Tokenization for Action Decoding

The core architecture builds on open-source VLMs such as Llava-Next or Qwen2-VL:

Vision Transformer (ViT): Processes raw RGB frames or UI snapshots into patch embeddings.
Image Adapter: A lightweight two-layer MLP projects ViT output into the LLM’s embedding space.
Autoregressive Transformer LLM: Text tokens and image tokens (via special "<img>" tokens) are jointly processed. For long-horizon or partially observable settings, multiple historical frames are concatenated in the prompt.
Action Decoder: Discrete actions (keyboard, mouse) and camera motions are retokenized. Mu-law discretization translates continuous movements into reserved language tokens (21 bins/axis).

No specialized attention or grounding heads are introduced. The unified next-token objective enables the VLM to output both descriptions (narrative, world knowledge) and low-level action commands.

3. Pretraining and Data Augmentation Protocols

The post-training corpus combines three sources:

World Knowledge QA: 277K entries from Wikipedia and community forums, formatted as multi-turn dialog (1.5K–2K tokens per item), augmenting classical LLM pretraining.
Vision–Language Alignment: 35K annotated keyframes from YouTube agent rollouts, with 15K captions and 20K VQA pairs, curated via advanced VLMs (GPT-4o, Claude 3.5, Molmo) and filtered by Llama 3.1.
Spatial Grounding: 404K instances, including 236K embodied frames (localized by SAM2) and 168K GUI inventory slot examples.

Extensive image data augmentations are applied, including random hue/saturation/brightness/contrast shifts, translation, rotation, scaling, shearing, and flipping (notably during Stage I/II). Training utilized a cosine learning-rate schedule, AdamW optimizer, bfloat16 precision, and large-scale distributed GPU resources (32×A800-80GB, total 640 GPU-hours).

4. Evaluation in Minecraft: Task Success and Benchmarks

Evaluation is performed using the MCU benchmark on over 1,000 atomic subtasks:

Mining: e.g., "Mine iron ore with stone pickaxe"
Killing: e.g., "Kill a sheep/spider"
Crafting: e.g., "Craft diamond sword"
Smelting: e.g., "Smelt iron ingot"
Cooking: e.g., "Cook beef"
GUI Tasks: Crafting and inventory manipulation via discrete slot selection

Each task is evaluated over at least 30 runs, averaging success rates. The results are summarized as follows:

Model	Mine Blocks	Kill Entities	Craft Items	Smelt Items
VPT-BC (248M)	0.33	0.44	0.41	0.05
STEVE-1 (248M)	0.54	0.38	0.57	0.33
MineDreamer	0.55	0.39	0.42	0.30
Qwen2-VL (IL)	0.75	0.86	0.65	0.29
–Qwen2-VL (Posttr.)	0.88	0.95	0.77	0.70

JARVIS-VLA, denoted by "–Qwen2-VL," achieves a ~40% improvement over the best agent baseline in key categories. In GUI-intensive tasks, benefits from vision-language post-training are especially pronounced.

5. Ablation Studies and Scaling Laws

To disentangle the effects of each modality, three ablations are performed using Qwen2-VL as the base:

World Knowledge Only: Improves reasoning, but less impact than visual signals.
Vision–Language Alignment Only: Major gains in grounding and captioning.
Spatial Grounding Only: Yields the largest improvement in long-horizon embodied tasks.

Increasing the number of imitation trajectories reduces $\mathcal{L}_\mathrm{IL}$ below ≈0.22, beyond which task success rates sharply rise. Lower evaluation losses in Stage II are linearly correlated with higher downstream success rates, establishing scaling laws for vision-language post-training analogous to those found in pure-text LLMs.

6. Contributions, Limitations, and Prospects

Contributions:

Introduction of first VLA agents in Minecraft compliant with free-form human instruction over >1,000 atomic subtasks.
Definition and empirical validation of the "Visual Language Post-Training" paradigm; self-supervised vision-language signals boost decision-making by ~40% over best imitator baselines.
Scaling laws for post-training dataset size and model size are demonstrated across Llava-Next and Qwen2-VL backbones.
Resources (code, checkpoints, datasets) publicly available at https://craftjarvis.github.io/JarvisVLA.

Limitations:

Inference throughput remains suboptimal (≈55 FPS on 4×A800s; below human 40 Hz target).
Performance gap persists to top human players (≈90% success).
Further work is needed to optimize inference with sparse MoE, scale non-trajectory corpora, and generalize to other open-world settings.

A plausible implication is that this post-training paradigm can be generalized to other domains where active decision-making under vision-language guidance is critical. The staged training mode enhances sample efficiency, reduces trajectory annotation requirements, and yields multifaceted agents with advanced world knowledge and spatial reasoning (Li et al., 20 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to JARVIS-VLA.