ProgressLM-3B: Structured Vision-Language Reasoning
- ProgressLM-3B is a 3B-parameter multimodal transformer that estimates task progress by retrieving a reference step and simulating state changes.
- It integrates visual and textual modalities using a structured two-stage paradigm—episodic retrieval followed by mental simulation—to deliver interpretable predictions.
- Empirical results show significant performance gains with reduced error and enhanced robustness compared to baseline vision-language models.
ProgressLM-3B is a 3-billion-parameter multimodal transformer designed for progress reasoning in vision-language contexts, particularly in robotic manipulation and activity monitoring. Unlike conventional vision-LLMs (VLMs) that focus on static content recognition, ProgressLM-3B systematically estimates how far a given task has progressed based on episodic visual or textual inputs. The model incorporates a structured, human-inspired two-stage schema to anchor its predictions in interpretable reasoning steps, enabling robust handling of diverse modalities, viewpoints, and ambiguous cases (Zhang et al., 21 Jan 2026).
1. Model Architecture
ProgressLM-3B is built on the Qwen2.5-VL-3B backbone, featuring a 3B parameter transformer:
- Visual Encoder: A Vision Transformer (ViT)-style stack processes input images into sequences of 768-dimensional patch feature vectors. These are linearly projected and interfaced with the LLM via cross-attention layers.
- Language Backbone: The language component consists of an autoregressive transformer decoder (~24 layers, 32 attention heads), interleaving self-attention over text tokens with cross-attention to visual tokens.
- Cross-Modal Fusion: Each decoder layer has a cross-attention block that integrates both image embeddings () and text prefix embeddings (), thereby aligning visual observations and textual demonstrations with generated output tokens.
- Structured Output Head: The terminal transformer layer feeds a classification head that generates tokens according to a dedicated schema: ref_think, ref, score_think, score. Fine-tuning is performed via low-rank adapters (LoRA, rank = 8) so that only a small subset of parameters are updated during training.
This architecture enforces modular cross-modal information exchange and structured reasoning output, differentiating it from purely end-to-end regression-based VLMs (Zhang et al., 21 Jan 2026).
2. Two-Stage Progress Reasoning Paradigm
ProgressLM-3B operationalizes a two-stage progress reasoning process inspired by cognitive science:
- Episodic Retrieval: The model first localizes the current observation relative to a demonstration sequence, identifying the most semantically similar reference step (frame or action).
- Mental Simulation: Conditioned on this anchor, the model reasons about how the task state has changed, estimating normalized progress from the retrieved reference to the current observation.
During both prompting and training, the model outputs the following four fields in mandatory sequence:
| Field | Description |
|---|---|
| ref_think | Natural-language justification for the retrieved demonstration step |
| ref | Index of the chosen reference step |
| score_think | Reasoning about state changes between anchor and observation |
| score | Final numeric progress estimate in |
Ground-truth chains of thought (CoT) provide supervised learning signals for both retrieval and simulation. At inference, this schema prohibits direct regression, enforcing explicit articulation of intermediate reasoning steps. This configuration encourages interpretability and decomposability of errors (Zhang et al., 21 Jan 2026).
3. Training Data, Objectives, and Hyperparameters
ProgressLM-45K Dataset
- Composition: 45K samples; 25K for supervised fine-tuning (SFT) and 20K for reinforcement learning (RL).
- Sources: Demonstrations are sampled from 240 robot-manipulation trajectories (distinct from Progress-Bench), with both vision key-frame and text action formats.
- Annotations: Each sample contains a normalized progress value (, interpolated between steps) and an explicit retrieval anchor index ().
Training Functions
- Supervised Fine-Tuning (SFT):
where is the ground truth CoT trace for demonstration and observation .
- Reinforcement Learning (RL): Group Relative Policy Optimization (GRPO) employs a reward function combining adherence to the four-field schema, retrieval accuracy, and progress calibration:
with weights .
Optimization and Hyperparameters
- SFT: LoRA rank 8, learning rate , 10% warmup, cosine scheduler, effective batch size 64 (with gradient accumulation), 2 epochs.
- RL: Actor learning rate , KL penalty 0.01, batch size 64, 16 rollouts per prompt, 2 epochs on 20K samples (23 hours on 16H100 GPUs).
4. Benchmark Design and Empirical Results
Progress-Bench
Progress-Bench evaluates progress reasoning using 240 robotic trajectories and 3,325 sampled observations. Three key factors are systematically controlled:
- Demonstration Modality: Vision (key-frames) vs Text (action lists).
- Viewpoint: Same-view vs Cross-view within vision-based inputs.
- Answerability: Differentiation between well-defined progress and unanswerable (semantic mismatch) observations.
Evaluation Metrics
- Normalized Score Error (NSE): (lower = better).
- Progress Rank Correlation (PRC): Mean Spearman between predicted and ground-truth scores (higher = better).
- Answerable False Rejection Rate (AFRR): Fraction of answerable samples wrongly judged as unanswerable (lower = better).
- Unanswerable Detection Accuracy (UDA): Fraction of unanswerable samples correctly detected (higher = better).
Core Results
For answerable samples (vision/text modalities):
| Model | Vision NSE | Vision PRC | Text NSE | Text PRC |
|---|---|---|---|---|
| Qwen2.5-VL-3B (base) | 35.0 | 27.6 | 45.9 | 7.5 |
| ProgressLM-3B-SFT | 19.0 | 72.4 | 29.1 | 46.3 |
| ProgressLM-3B-RL | 13.8 | 90.1 | 21.2 | 63.9 |
ProgressLM-3B-RL reduces vision NSE by , increases vision PRC from 27.6 to 90.1, and halves text NSE. Under cross-view settings, PRC remains high (from 93.5 to 88.8), with moderate NSE degradation (from 10.3 to 15.2), highlighting robust viewpoint invariance.
For unanswerable cases, ProgressLM-3B-RL achieves UDA under both modalities, while AFRR remains moderate () (Zhang et al., 21 Jan 2026).
5. Diagnostic Analyses and Failure Modes
- Predicted Score Distributions: Base models often produce outputs concentrated at discrete progress values (e.g., 0%, 50%, 100%), indicating limited sensitivity to intermediate progress states—referred to as multi-peak clustering. ProgressLM-3B-RL achieves a near-uniform output distribution, reflecting genuine progress discernment.
- Per-Sample Error: Smaller models display heavy-tailed NSE distributions with significant outliers; ProgressLM-3B-RL centralizes error near zero, indicating substantial robustness.
- Stage Coupling: Joint analysis of retrieval (anchor) and simulation (score) reveals coupling: retrieved anchor steps tightly correspond with optimal progress indices across all conditions (same-view, cross-view, textual).
- Failure Modes:
- Viewpoint Sensitivity: Cross-view conditions degrade performance in base models (NSE up 5–15 points, PRC down up to 30). ProgressLM-3B-RL maintains reduced error inflation.
- Textual Demonstrations: Implicit state tracking is brittle; actions differing subtly are often confused without adequate history integration.
- Out-of-Domain Generalization: On human-activity data, ProgressLM-3B-RL keeps vision NSE 15% and PRC 70%, whereas base models deteriorate (NSE 30–60%, PRC sometimes negative).
6. Implications, Limitations, and Future Directions
Explicit training with two-stage schema enables ProgressLM-3B to consistently retrieve relevant anchors and simulate task progress, minimizing collapsed-heuristic reliance and fostering interpretability. RL-based fine-tuning further tightens performance through separate format, retrieval, and score-specific rewards.
Persisting limitations include:
- Text-only demonstration brittleness due to imprecise implicit state composition.
- Viewpoint invariance remains susceptible under extreme shifts.
- Answerability detection, though improved, can be conservative under ambiguous real-world cases.
Anticipated directions include integrating explicit 3D scene parsing or object-centric state tracking to fortify modality and viewpoint robustness, applying structured progress schemas in multi-stage planning tasks, and exploring self-supervised progress annotation from unconstrained video sources (Zhang et al., 21 Jan 2026).
ProgressLM-3B serves as the first 3B-parameter VLM exhibiting robust, interpretable progress reasoning by decomposing estimation into retrieval and simulation steps and validating the effectiveness of human-inspired structured schemas for temporal reasoning in multimodal models.