Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Relative Progress Estimation

Updated 11 November 2025
  • Relative Progress Estimation is a method that maps process evolution onto a normalized [0,1] scale, facilitating consistent monitoring and phase segmentation.
  • It leverages multimodal data fusion, deep regression with spatio-temporal models, and probabilistic phase inference to yield robust progress predictions.
  • RPE integrates self-supervised meta-learning and test-time adaptation, ensuring domain-agnostic performance across variable process speeds and durations.

Relative Progress Estimation (RPE) refers to the mapping of a process's evolution onto a normalized, unitless scale—most commonly [0,1][0,1]—where $0$ indicates initiation and $1$ indicates completion, irrespective of the absolute duration or speed of the unfolding process. RPE enables process understanding, online monitoring, and adaptive control by estimating the fraction of process completion, supporting downstream tasks such as phase segmentation, remaining time prediction, and generalization to variable process lengths or speeds. RPE methodologies encompass multimodal deep regression frameworks with spatio-temporal modeling, robust post-hoc phase inference via probabilistic modeling, and self-supervised meta-learning strategies for domain-agnostic adaptation.

1. Core Methodologies of Relative Progress Estimation

Early RPE systems use deep regression frameworks to directly predict a scalar progress value, y^[0,1]\hat y \in [0, 1], from multimodal sensory observations. In "Progress Estimation and Phase Detection for Sequential Processes" (Li et al., 2017), the pipeline consists of:

  • Multimodal Input: Sensor streams include visual (Kinect depth or RGB video) and auditory modalities (MFSC audio features).
  • Feature Extraction: Each modality is processed by a pretrained CNN backbone (e.g., AlexNet for depth, VGG for RGB), followed by stacked LSTM layers to capture temporal dependencies.
  • Fusion and Regression: Modality-specific features are fused then fed through fully connected layers, with a scalar progress score output by a linear head followed by a rectified hyperbolic tangent (rtanh\mathrm{rtanh}).

The regression target is the human-labeled normalized completeness at each frame. The rtanh\mathrm{rtanh} activation, defined as rtanh(x)=max(0,tanh(x))\mathrm{rtanh}(x) = \max(0, \tanh(x)), ensures outputs in [0,1][0, 1] with steeper gradients than a sigmoid, empirically resulting in 30% faster convergence for this task.

Modern approaches, as illustrated in "Test-Time Adaptation for Generalizable Task Progress Estimation" (Ziakas et al., 11 Jun 2025), formulate RPE as a goal-conditioned value function:

V:O×G[0,1]V: O \times G \rightarrow [0,1]

where oto_t is the current observation and gg is the natural-language task description. Relative progress is linearly assigned based on trajectory step index, yt=t/Ty_t = t / T, and learned via mean-squared error over demonstration sequences.

These frameworks uniformly target relative, not absolute, progress, supporting cross-process generality and robustness to speed or duration variability.

2. Spatio-Temporal Architectures and Model Components

RPE models integrate spatial, temporal, and semantic information:

Component Implementation in (Li et al., 2017) Implementation in (Ziakas et al., 11 Jun 2025)
Visual Backbone Pretrained AlexNet (depth), VGG (RGB) Frozen OpenCLIP ViT-B/32
Audio/Text Backbone MFSC audio via CNN/LSTM stack CLIP text encoder (task description)
Temporal Modeling Multiple stacked LSTM layers Implicit in meta-learned adaptation MLP
Fusion Concatenation and dense layers (FC1, FC2) Concatenation of CLIP vision + language
Regression/Head Linear + rtanh\mathrm{rtanh} activation Residual MLP, projection, final MLP head

Notably, (Ziakas et al., 11 Jun 2025) adapts progress estimation to the semantic context of each demonstration via a frozen contrastive vision-LLM (OpenCLIP) feeding a small adaptation MLP, with self-supervised adaptation steps occurring at test time. This decouples progress semantics from raw temporal cues, allowing for generalization across domains and process variants.

3. Loss Functions and Adaptation Strategies

Supervised learning for RPE typically employs a regression loss such as mean absolute error (MAE) or mean-squared error (MSE) between predicted and ground-truth progress y^\hat y and yy:

Lossc(θ)=1Di=1DR(θ,Di)pi\mathrm{Loss}_c(\theta) =\frac1{|D|}\sum_{i=1}^{|D|}|R(\theta, D_i) - p_i|

where pip_i is manual completeness labeling, and RR denotes the regression model output.

  • Conditional Consistency Loss: When process phases are missed by the classifier, an auxiliary penalty proportional to the offset from the phase-mean completeness is added:

Lossp={0p^=p R(θ,D)μpp^p\mathrm{Loss}_p = \begin{cases} 0 & \hat p = p \ |R(\theta, D) - \mu_p| & \hat p \neq p \end{cases}

The total loss is the weighted sum: Loss=αLossc+βLossp\mathrm{Loss} = \alpha\mathrm{Loss}_c + \beta\mathrm{Loss}_p with α=0.6\alpha=0.6, β=0.4\beta=0.4.

In (Ziakas et al., 11 Jun 2025), RPE models are meta-trained so that a small number of steps of self-supervised loss,

Lself(x;θ)=fadapt(PKx;θ)Pvx22L_\text{self}(x; \theta) = \|f_\text{adapt}(P_K x; \theta) - P_v x\|_2^2

where xx is the concatenated vision-text embedding and fadaptf_\text{adapt} is a two-layer MLP, improve the prediction loss LpredL_\text{pred} post-adaptation. Meta-training uses a MAML-style outer loop, optimizing for fast, effective in-situ test-time adaptation. At inference, only one gradient step per frame on LselfL_\text{self} is required when operating in the implicit memory regime.

Test-Time Adaptation Variants:

  • Explicit-memory (EX): Reset adaptation parameters at each frame and adapt over a local window.
  • Implicit-memory (IM): Carry adaptation state forward, updating incrementally.

VOC (Value Order Correlation, Spearman ρ between progress and time) illustrates that implicit memory adaptation yields substantially higher OOD performance (e.g., $0.8203$ vs $0.0423$ for DeepThought tk_pnp), demonstrating the advantage of incremental, stateful test-time adaptation for RPE (Ziakas et al., 11 Jun 2025).

4. Phase Segmentation and Remaining-Time Estimation

For processes with well-defined sequential phases, RPE supports phase inference and remaining-time estimation:

  • Phase Segmentation (Li et al., 2017):
    • Probabilistic Modeling: Each phase kk is modeled as a Gaussian over completeness values, with parameters (μk,Σk,wk)(\mu_k, \Sigma_k, w_k) fit to ground-truth data.
    • Phase Prediction: At test time, given estimated completeness xx, select argmax of log-likelihood among phase Gaussians.

p^=argmax1kKk(x)\hat{p} = \arg\max_{1 \le k \le K} \ell_k(x)

where

k(x)=log(wk)12log(det(2πΣk))12(xμk)Σk1(xμk)\ell_k(x) = \log(w_k) - \tfrac{1}{2} \log(\det(2\pi\Sigma_k)) - \tfrac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k)

  • Remaining-Time Estimation:

t=τρ(1ρ)t = \frac{\tau}{\rho}(1 - \rho)

with ρ\rho the current estimated completeness and τ\tau elapsed time. This simple formula enables consistent real-time remaining time prediction, which is robust in settings where process durations vary.

A plausible implication is that with phase-inferred time estimation anchored in completeness, these models offer robust temporal reasoning even under pronounced process speed fluctuations or irregular event timings.

5. Evaluation Metrics and Empirical Performance

Evaluation schemes consistently employ both regression and classification metrics:

Dataset/Task Completeness MAE Phase Accuracy F1-score Remaining-Time MAE
Trauma Resuscitation 12.65% 86.06% 0.67 7.5 min (14%)
Olympic Swimming 6.32% 87.99% 0.58 2.2 min (18%)

(Li et al., 2017) further applies precision, recall, 2SET metrics, and Matthews correlation coefficient for segmentation analysis. Empirical comparisons to SVM, Random Forest, and previous CNN/classical systems (e.g., EndoVis, LapChole) validate the state-of-the-art performance achieved by the fusion of CNN-LSTM spatio-temporal modeling with GMM-based phase decoding.

(Ziakas et al., 11 Jun 2025) reports VOC as the primary metric. Implicit-memory adaptation (TTT-IM) achieves VOC in the range $0.60$–$0.82$ across out-of-distribution domain/embodiment shifts, substantially outperforming CLIP zero-shot regression, in-context Gemini 1.5 Pro, and non-adaptive baselines.

6. Domain Generalization and Robustness

RPE methods leveraging relative, rather than absolute, progress support automatic adaptation to variable process durations and speeds. In (Li et al., 2017), normalized output ensures logical ordering even when absolute speed differs, and phase segmentation via learned GMMs mitigates noise and sensor irregularities.

(Ziakas et al., 11 Jun 2025) further extends domain robustness using semantic conditioning and test-time adaptation. By constructing embeddings that jointly encode visual observations and language descriptions, and enabling self-supervised fine-tuning during inference, the method enables generalization from a single training environment to unseen tasks, robots, and contexts, outperforming in-context learning approaches such as Gemini-based vision-LLMs.

Retention of adaptation state over time, rather than per-frame resets, is crucial for robust long-horizon RPE. Ablations demonstrate that meta-learned test-time updates (implicit memory) are necessary for high correlation with true process order across OOD domains and robot embodiments.

7. Limitations and Directions for Future Work

Current RPE systems largely operationalize time-based normalization for completeness, i.e., t/Tt/T. While this is effective in domains where process progress aligns with elapsed time, it may be suboptimal in scenarios where progress is better represented by the completion of specific activities or semantic subgoals. The authors note that alternative, semantically grounded labeling strategies (e.g., activity-based completeness) remain an open research direction (Li et al., 2017).

Further, the efficacy of self-supervised adaptation in (Ziakas et al., 11 Jun 2025) depends on the quality of the learned embedding space and the ability of LselfL_\text{self} to encode semantic task progression across diverse domains. This suggests joint optimization of embedding and adaptation heads—or integration with more advanced temporal architectures—could yield further improvements.

Broadly, RPE frameworks that combine spatio-temporal multimodal modeling, normalization to relative progress, and online self-supervised adaptation constitute the state of the art for process progress understanding in multi-domain sequential tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Relative Progress Estimation (RPE).