Papers
Topics
Authors
Recent
Search
2000 character limit reached

ProgressLM-3B: Structured Vision-Language Reasoning

Updated 28 January 2026
  • ProgressLM-3B is a 3B-parameter multimodal transformer that estimates task progress by retrieving a reference step and simulating state changes.
  • It integrates visual and textual modalities using a structured two-stage paradigm—episodic retrieval followed by mental simulation—to deliver interpretable predictions.
  • Empirical results show significant performance gains with reduced error and enhanced robustness compared to baseline vision-language models.

ProgressLM-3B is a 3-billion-parameter multimodal transformer designed for progress reasoning in vision-language contexts, particularly in robotic manipulation and activity monitoring. Unlike conventional vision-LLMs (VLMs) that focus on static content recognition, ProgressLM-3B systematically estimates how far a given task has progressed based on episodic visual or textual inputs. The model incorporates a structured, human-inspired two-stage schema to anchor its predictions in interpretable reasoning steps, enabling robust handling of diverse modalities, viewpoints, and ambiguous cases (Zhang et al., 21 Jan 2026).

1. Model Architecture

ProgressLM-3B is built on the Qwen2.5-VL-3B backbone, featuring a 3B parameter transformer:

  • Visual Encoder: A Vision Transformer (ViT)-style stack processes input images into sequences of 768-dimensional patch feature vectors. These are linearly projected and interfaced with the LLM via cross-attention layers.
  • Language Backbone: The language component consists of an autoregressive transformer decoder (~24 layers, 32 attention heads), interleaving self-attention over text tokens with cross-attention to visual tokens.
  • Cross-Modal Fusion: Each decoder layer has a cross-attention block that integrates both image embeddings (Eimg\langle E_\text{img} \rangle) and text prefix embeddings (Etxt\langle E_\text{txt} \rangle), thereby aligning visual observations and textual demonstrations with generated output tokens.
  • Structured Output Head: The terminal transformer layer feeds a classification head that generates tokens according to a dedicated schema: \langleref_think\rangle, \langleref\rangle, \langlescore_think\rangle, \langlescore\rangle. Fine-tuning is performed via low-rank adapters (LoRA, rank = 8) so that only a small subset of parameters are updated during training.

This architecture enforces modular cross-modal information exchange and structured reasoning output, differentiating it from purely end-to-end regression-based VLMs (Zhang et al., 21 Jan 2026).

2. Two-Stage Progress Reasoning Paradigm

ProgressLM-3B operationalizes a two-stage progress reasoning process inspired by cognitive science:

  1. Episodic Retrieval: The model first localizes the current observation relative to a demonstration sequence, identifying the most semantically similar reference step (frame or action).
  2. Mental Simulation: Conditioned on this anchor, the model reasons about how the task state has changed, estimating normalized progress from the retrieved reference to the current observation.

During both prompting and training, the model outputs the following four fields in mandatory sequence:

Field Description
\langleref_think\rangle Natural-language justification for the retrieved demonstration step
\langleref\rangle Index of the chosen reference step
\langlescore_think\rangle Reasoning about state changes between anchor and observation
\langlescore\rangle Final numeric progress estimate in [0%,100%][0\%, 100\%]

Ground-truth chains of thought (CoT) provide supervised learning signals for both retrieval and simulation. At inference, this schema prohibits direct regression, enforcing explicit articulation of intermediate reasoning steps. This configuration encourages interpretability and decomposability of errors (Zhang et al., 21 Jan 2026).

3. Training Data, Objectives, and Hyperparameters

ProgressLM-45K Dataset

  • Composition: 45K samples; 25K for supervised fine-tuning (SFT) and 20K for reinforcement learning (RL).
  • Sources: Demonstrations are sampled from 240 robot-manipulation trajectories (distinct from Progress-Bench), with both vision key-frame and text action formats.
  • Annotations: Each sample contains a normalized progress value (pp^*, interpolated between steps) and an explicit retrieval anchor index (jj^*).

Training Functions

  1. Supervised Fine-Tuning (SFT):

LSFT=1Ni=1NlogPθ(riDi,oi)L_\text{SFT} = -\frac{1}{N} \sum_{i=1}^N \log P_\theta(r^{i*} \mid D^i, o^i)

where rir^{i*} is the ground truth CoT trace for demonstration DiD^i and observation oio^i.

  1. Reinforcement Learning (RL): Group Relative Policy Optimization (GRPO) employs a reward function combining adherence to the four-field schema, retrieval accuracy, and progress calibration:

LRL=ErPθ[αRformat(r)+βRref(r)+γRscore(r)]L_\text{RL} = - \mathbb{E}_{r \sim P_\theta}[ \alpha R_\text{format}(r) + \beta R_\text{ref}(r) + \gamma R_\text{score}(r)]

with weights α:β:γ=1:6:3\alpha: \beta: \gamma = 1:6:3.

Optimization and Hyperparameters

  • SFT: LoRA rank 8, learning rate 1×1041 \times 10^{-4}, 10% warmup, cosine scheduler, effective batch size 64 (with gradient accumulation), 2 epochs.
  • RL: Actor learning rate 1×1061 \times 10^{-6}, KL penalty 0.01, batch size 64, 16 rollouts per prompt, 2 epochs on \sim20K samples (\sim23 hours on 16×\timesH100 GPUs).

4. Benchmark Design and Empirical Results

Progress-Bench

Progress-Bench evaluates progress reasoning using 240 robotic trajectories and 3,325 sampled observations. Three key factors are systematically controlled:

  • Demonstration Modality: Vision (key-frames) vs Text (action lists).
  • Viewpoint: Same-view vs Cross-view within vision-based inputs.
  • Answerability: Differentiation between well-defined progress and unanswerable (semantic mismatch) observations.

Evaluation Metrics

  • Normalized Score Error (NSE): p^p/max(p,1p)| \hat{p} - p | / \max(p, 1 - p) (lower = better).
  • Progress Rank Correlation (PRC): Mean Spearman ρ\rho between predicted and ground-truth scores (higher = better).
  • Answerable False Rejection Rate (AFRR): Fraction of answerable samples wrongly judged as unanswerable (lower = better).
  • Unanswerable Detection Accuracy (UDA): Fraction of unanswerable samples correctly detected (higher = better).

Core Results

For answerable samples (vision/text modalities):

Model Vision NSE Vision PRC Text NSE Text PRC
Qwen2.5-VL-3B (base) 35.0 27.6 45.9 7.5
ProgressLM-3B-SFT 19.0 72.4 29.1 46.3
ProgressLM-3B-RL 13.8 90.1 21.2 63.9

ProgressLM-3B-RL reduces vision NSE by 60%\sim 60\%, increases vision PRC from 27.6 to 90.1, and halves text NSE. Under cross-view settings, PRC remains high (from 93.5 to 88.8), with moderate NSE degradation (from 10.3 to 15.2), highlighting robust viewpoint invariance.

For unanswerable cases, ProgressLM-3B-RL achieves UDA >90%>90\% under both modalities, while AFRR remains moderate (8%\sim 8\%) (Zhang et al., 21 Jan 2026).

5. Diagnostic Analyses and Failure Modes

  • Predicted Score Distributions: Base models often produce outputs concentrated at discrete progress values (e.g., 0%, 50%, 100%), indicating limited sensitivity to intermediate progress states—referred to as multi-peak clustering. ProgressLM-3B-RL achieves a near-uniform output distribution, reflecting genuine progress discernment.
  • Per-Sample Error: Smaller models display heavy-tailed NSE distributions with significant outliers; ProgressLM-3B-RL centralizes error near zero, indicating substantial robustness.
  • Stage Coupling: Joint analysis of retrieval (anchor) and simulation (score) reveals coupling: retrieved anchor steps tightly correspond with optimal progress indices across all conditions (same-view, cross-view, textual).
  • Failure Modes:
    • Viewpoint Sensitivity: Cross-view conditions degrade performance in base models (NSE up 5–15 points, PRC down up to 30). ProgressLM-3B-RL maintains reduced error inflation.
    • Textual Demonstrations: Implicit state tracking is brittle; actions differing subtly are often confused without adequate history integration.
    • Out-of-Domain Generalization: On human-activity data, ProgressLM-3B-RL keeps vision NSE \approx 15% and PRC \approx 70%, whereas base models deteriorate (NSE 30–60%, PRC sometimes negative).

6. Implications, Limitations, and Future Directions

Explicit training with two-stage schema enables ProgressLM-3B to consistently retrieve relevant anchors and simulate task progress, minimizing collapsed-heuristic reliance and fostering interpretability. RL-based fine-tuning further tightens performance through separate format, retrieval, and score-specific rewards.

Persisting limitations include:

  • Text-only demonstration brittleness due to imprecise implicit state composition.
  • Viewpoint invariance remains susceptible under extreme shifts.
  • Answerability detection, though improved, can be conservative under ambiguous real-world cases.

Anticipated directions include integrating explicit 3D scene parsing or object-centric state tracking to fortify modality and viewpoint robustness, applying structured progress schemas in multi-stage planning tasks, and exploring self-supervised progress annotation from unconstrained video sources (Zhang et al., 21 Jan 2026).

ProgressLM-3B serves as the first 3B-parameter VLM exhibiting robust, interpretable progress reasoning by decomposing estimation into retrieval and simulation steps and validating the effectiveness of human-inspired structured schemas for temporal reasoning in multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProgressLM-3B.