OpenVLA-7B Vision-Language-Action Framework

Updated 22 January 2026

OpenVLA-7B is an open-source framework that integrates a 7B-parameter Llama 2 backbone with dual-stream visual encoders for unified vision-language-action tasks.
It employs token-level fusion by appending visual features as tokens alongside language inputs, enabling seamless multi-modal processing without extra cross-modal attention.
The framework supports comprehensive fine-tuning methods and scalable online reinforcement learning via VLA-RL, advancing efficient and reproducible robotic manipulation.

OpenVLA-7B is an open-source vision-language-action (VLA) model framework that integrates high-capacity transformer-based language modeling with dual-stream visual encoders for general robotic manipulation. OpenVLA-7B is trained on extensive robot demonstration data and designed for robust, multi-robot, multi-task, and multi-modal control, with subsequent enhancements using scalable online reinforcement learning via the VLA-RL methodology. The framework advances accessible, efficient, and reproducible VLA research for visuomotor policy deployment and adaptation (Kim et al., 2024, Lu et al., 24 May 2025).

OpenVLA-7B utilizes a Llama 2 LLM backbone with 7 billion parameters (32 layers, $d\approx4096$ , 32 attention heads). Any standard Llama byte-pair-encoding tokenizer is modified: the last 256 unused tokens are repurposed to encode discretized action bins for the robot control signal. The dual-stream visual module consists of DINOv2 (600M parameters, specializing in spatial representation) and SigLIP (600M, focusing on semantic encoding). For each $I\in\mathbb{R}^{3\times 224\times 224}$ input, the image is partitioned into $P$ patches, each embedded by both networks and concatenated. A 2-layer MLP projects the resulting concatenated vision feature $v_i\in\mathbb{R}^{D_v}$ ( $D_v\approx4096$ ) to match the LLM’s token embedding dimension.

Fusion is achieved by directly treating the projected visual features as tokens appended to the natural language instruction tokens, allowing unmodified transformer self-attention across the multi-modal input. This “token-level fusion” obviates the need for explicit cross-modal attention heads or gating structures.

Action decoding is mapped as follows: the continuous 7-dimensional end-effector command $a_t\in\mathbb{R}^7$ is discretized via quantile binning (per dimension, 256 bins). The corresponding 1792 tokens overwrite the end of the Llama BPE vocabulary, enabling $a_t$ to be naturally decoded as a sequence of next-token predictions. The joint policy factorization is:

$\pi_\theta(u_t\mid x_t^{vis}, x_t^{lang}) = \prod_{i=1}^7 P_\theta(\text{atoken}_{t,i} \mid a_{<t}, x_t^{vis}, x_t^{lang})$

with $u_t=(\text{atoken}_{t,1},\ldots,\text{atoken}_{t,7})$ and each $P_\theta$ is the softmax over the appropriate token logits.

2. Training Data, Imitation Objective, and Preprocessing

OpenVLA-7B is pretrained on $\approx$ 970,000 robot episodes distilled from the Open X-Embodiment collection, with an original pool of $\approx$ 2 million demonstrations. Data selection prioritizes single-arm manipulator datasets recorded in third-person RGB, normalizing the action space to 7-DoF end-effector representations. Mixture weights from Octo and other policies are applied to control curriculum balance, with certain sources (e.g., DROID) being selectively sampled or removed for optimization efficiency.

For preprocessing, transitions with trivial (zero) actions are excised to avoid learning idleness—particularly demo initializations. Each action dimension is mapped into 256 quantile bins, ignoring outliers for effective discretization. These bins directly overwrite BPE tokens in the LLM’s vocabulary to enable natural sequence modeling. The sole objective is standard next-token cross-entropy, defined for action tokens:

$\mathcal{L}(\theta) = -\mathbb{E}_{(\tau)} \sum_{t=1}^T\sum_{i=1}^7 \log P_\theta(\text{atoken}_{t,i} \mid a_{<t}, x_t^{vis}, x_t^{lang})$

No auxiliary loss is required, as all task and language grounding emerges from this imitation learning.

3. Adaptation and Fine-Tuning Techniques

The framework supports comprehensive fine-tuning modalities:

Full Fine-Tuning: All 7B parameters updated (on the order of 5–15 hours on 8 A100 GPUs for 10–150 demonstrations per task).
LoRA Adaptation: Low-rank adapters ( $r=32$ ) injected into each transformer linear projection; only $\sim$ 1.4% of parameters ( $\approx$ 98M) are updated. LoRA matches full fine-tuning performance (e.g., 68.2% vs. 69.7% on select Franka robotic tasks) while substantially reducing resource requirements, fitting batch sizes up to 16 on a single A100 within $\sim$ 60GB RAM.
Ablations: Strategies such as “last-layer-only,” “frozen vision,” or “sandwich” finetuning generally underperform LoRA in both success rate and compute efficiency.

4. Reinforcement Learning with VLA-RL

VLA-RL extends OpenVLA-7B to online reinforcement learning, improving out-of-distribution performance and exploration by recasting manipulation as trajectory-level, token-based multi-turn dialogue.

Formulation: Robotic manipulation is modeled as a POMDP; the agent’s state at time $t$ is $s_t=(o_t, v_t^{in})$ where $o_t$ is the visual observation and $v_t^{in}$ the instruction. The autoregressive policy $\pi_\theta$ emits an action token sequence $a_t = v_t^{out}$ , from which 7-DoF actions are decoded.
Objective: The RL objective is to maximize discounted return across trajectories:

$J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R^\gamma(\tau)]\,,\qquad R^\gamma(\tau) = \sum_{t=1}^T \gamma^{t-1} r_t$

Gradient optimization uses REINFORCE and PPO, with value function and advantage estimation:

$L^{PPO}(\theta) = \mathbb{E}_t\left[ \min(r_t(\theta) \hat{A}_t\,,\,\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]$

where $r_t(\theta)$ is the importance ratio and $\hat{A}_t$ is the GAE advantage estimate.

Reward Model: Sparse episode-level rewards are densified by the Robotic Process Reward Model (RPRM)—a fine-tuned vision-LLM with the same backbone as OpenVLA-7B. RPRM leverages milestone segmentation and velocity-based pseudo-reward labeling, enabling dense feedback at each timestep.

5. Infrastructure, Quantization, and Efficiency

OpenVLA-7B is designed for training and inference efficiency across a range of hardware targets.

Training Compute: Pretraining utilizes 64 A100 GPUs for 14 days ( $\approx$ 21,500 A100-hours), batch size 2048 over 27 epochs. Fine-tuning typically requires 8 A100s (full) or a single A100 (LoRA) for 5–15 hours.
Inference Strategies:
- bfloat16 precision: $\sim$ 15GB VRAM, real-time control at 6 Hz (RTX 4090) or 10 Hz (H100).
- 4-bit quantization: Reduces VRAM to 7GB, maintains success rates (e.g., 71.9% ± 4.7% over 80 rollouts).
- 8-bit quantization: Further compresses to 10GB but dequantization overhead degrades speed (1–2 Hz).
- Inference server is implemented as a gRPC streaming API; supports both precision modes and speculative decoding.
Batch Decoding and Parallelization: In RL, vLLM and LoRA-merged weights allow batch action generation up to 64, reducing latency ( $\sim$ 15ms per step). Vectorized environments (64 simulators per GPU) enable data-efficient rollouts; curriculum sampling targets the 50% success frontier for improved exploration.

6. Empirical Evaluation and Benchmarking

OpenVLA-7B exhibits strong performance across zero-shot, multi-task, and fine-tuning benchmarks:

Multi-Robot Manipulation (BridgeData V2, 170 rollouts, 17 tasks):

OpenVLA: 70.6% ± 3.2%
RT-2-X (closed, 55B): 50.6% ± 3.5%
Octo (open, 93M): 20.0% ± 2.6%

Generalization (Google Robot, 12 tasks, 60 rollouts):

OpenVLA: 85.0% ± 4.6%
RT-2-X: 78.3% ± 5.4%
Octo: 26.7%, RT-1-X: 33.3%

Data-Efficient Fine-Tuning (Franka arms, 7 tasks, 129 rollouts):

OpenVLA (full): 67.2% ± 4.0%
Diffusion Policy: 48.5% ± 4.9%
Octo (fine-tuned): 43.4% ± 4.4%

Parameter-Efficient Adaptation:

LoRA-r=32: 68.2% ± 7.5%, 97.6M params
int4 quantization: matches full VLA success, halves VRAM

LIBERO Tasks (VLA-RL):

Across 40 benchmark tasks (Spatial, Object, Goal, Long), the VLA-RL finetuned OpenVLA-7B achieves:

Average success rate: 81.0%, ranking 1.5 (outperforming all open-source and matching commercial $\pi_0$ -FAST at 85.5%)
Ablations show RPRM reward model, curriculum, and critic warmup all substantially contribute to final success.

Method	Average Success Rate	Average Rank
Diffusion Policy	72.4%	4.0
Octo (SFT)	75.1%	3.5
OpenVLA (SFT)	76.5%	3.5
GRAPE (DPO)	79.2%	2.3
π₀-FAST	85.5%	—
VLA-RL (Ours)	81.0%	1.5

A plausible implication is that scaling inference compute (batch size, search) for autoregressive VLA policies yields monotonic gains (“inference scaling laws”), analogous to phenomena in LLMs (Lu et al., 24 May 2025).

7. Openness, Code Availability, and Reproducibility

All model weights (bfloat16, 4-bit), PyTorch codebase, fine-tuning notebooks, and instructional data pipelines are publicly released at https://openvla.github.io and on HuggingFace model hub. The repository provides built-in support for HuggingFace AutoModel, FSDP, FlashAttention, AMP, LoRA, and bitsandbytes quantization. Wizard-of-Oz notebooks facilitate easy reproducible finetuning and deployment, including gRPC server/client for real-time streaming control (Kim et al., 2024).

The OpenVLA-7B framework establishes a reproducible, extensible, and scalable baseline for vision-language-action research, uniting large-scale robot demonstration data, multi-modal transformer architectures, parameter-efficient adaptation, and high-throughput deployment.

Markdown Upgrade to Chat

References (2)

OpenVLA: An Open-Source Vision-Language-Action Model (2024)

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenVLA-7B Framework.