Relative Progress Estimation

Updated 11 November 2025

Relative Progress Estimation is a method that maps process evolution onto a normalized [0,1] scale, facilitating consistent monitoring and phase segmentation.
It leverages multimodal data fusion, deep regression with spatio-temporal models, and probabilistic phase inference to yield robust progress predictions.
RPE integrates self-supervised meta-learning and test-time adaptation, ensuring domain-agnostic performance across variable process speeds and durations.

Relative Progress Estimation (RPE) refers to the mapping of a process's evolution onto a normalized, unitless scale—most commonly $[0,1]$ —where $0$ indicates initiation and $1$ indicates completion, irrespective of the absolute duration or speed of the unfolding process. RPE enables process understanding, online monitoring, and adaptive control by estimating the fraction of process completion, supporting downstream tasks such as phase segmentation, remaining time prediction, and generalization to variable process lengths or speeds. RPE methodologies encompass multimodal deep regression frameworks with spatio-temporal modeling, robust post-hoc phase inference via probabilistic modeling, and self-supervised meta-learning strategies for domain-agnostic adaptation.

1. Core Methodologies of Relative Progress Estimation

Early RPE systems use deep regression frameworks to directly predict a scalar progress value, $\hat y \in [0, 1]$ , from multimodal sensory observations. In "Progress Estimation and Phase Detection for Sequential Processes" (Li et al., 2017), the pipeline consists of:

Multimodal Input: Sensor streams include visual (Kinect depth or RGB video) and auditory modalities (MFSC audio features).
Feature Extraction: Each modality is processed by a pretrained CNN backbone (e.g., AlexNet for depth, VGG for RGB), followed by stacked LSTM layers to capture temporal dependencies.
Fusion and Regression: Modality-specific features are fused then fed through fully connected layers, with a scalar progress score output by a linear head followed by a rectified hyperbolic tangent ( $\mathrm{rtanh}$ ).

The regression target is the human-labeled normalized completeness at each frame. The $\mathrm{rtanh}$ activation, defined as $\mathrm{rtanh}(x) = \max(0, \tanh(x))$ , ensures outputs in $[0, 1]$ with steeper gradients than a sigmoid, empirically resulting in 30% faster convergence for this task.

Modern approaches, as illustrated in "Test-Time Adaptation for Generalizable Task Progress Estimation" (Ziakas et al., 11 Jun 2025), formulate RPE as a goal-conditioned value function:

$V: O \times G \rightarrow [0,1]$

where $o_t$ is the current observation and $g$ is the natural-language task description. Relative progress is linearly assigned based on trajectory step index, $y_t = t / T$ , and learned via mean-squared error over demonstration sequences.

These frameworks uniformly target relative, not absolute, progress, supporting cross-process generality and robustness to speed or duration variability.

2. Spatio-Temporal Architectures and Model Components

RPE models integrate spatial, temporal, and semantic information:

Component	Implementation in (Li et al., 2017)	Implementation in (Ziakas et al., 11 Jun 2025)
Visual Backbone	Pretrained AlexNet (depth), VGG (RGB)	Frozen OpenCLIP ViT-B/32
Audio/Text Backbone	MFSC audio via CNN/LSTM stack	CLIP text encoder (task description)
Temporal Modeling	Multiple stacked LSTM layers	Implicit in meta-learned adaptation MLP
Fusion	Concatenation and dense layers (FC1, FC2)	Concatenation of CLIP vision + language
Regression/Head	Linear + $\mathrm{rtanh}$ activation	Residual MLP, projection, final MLP head

Notably, (Ziakas et al., 11 Jun 2025) adapts progress estimation to the semantic context of each demonstration via a frozen contrastive vision-LLM (OpenCLIP) feeding a small adaptation MLP, with self-supervised adaptation steps occurring at test time. This decouples progress semantics from raw temporal cues, allowing for generalization across domains and process variants.

3. Loss Functions and Adaptation Strategies

Supervised learning for RPE typically employs a regression loss such as mean absolute error (MAE) or mean-squared error (MSE) between predicted and ground-truth progress $\hat y$ and $y$ :

(Li et al., 2017):

$\mathrm{Loss}_c(\theta) =\frac1{|D|}\sum_{i=1}^{|D|}|R(\theta, D_i) - p_i|$

where $p_i$ is manual completeness labeling, and $R$ denotes the regression model output.

Conditional Consistency Loss: When process phases are missed by the classifier, an auxiliary penalty proportional to the offset from the phase-mean completeness is added:

$\mathrm{Loss}_p = \begin{cases} 0 & \hat p = p \ |R(\theta, D) - \mu_p| & \hat p \neq p \end{cases}$

The total loss is the weighted sum: $\mathrm{Loss} = \alpha\mathrm{Loss}_c + \beta\mathrm{Loss}_p$ with $\alpha=0.6$ , $\beta=0.4$ .

In (Ziakas et al., 11 Jun 2025), RPE models are meta-trained so that a small number of steps of self-supervised loss,

$L_\text{self}(x; \theta) = \|f_\text{adapt}(P_K x; \theta) - P_v x\|_2^2$

where $x$ is the concatenated vision-text embedding and $f_\text{adapt}$ is a two-layer MLP, improve the prediction loss $L_\text{pred}$ post-adaptation. Meta-training uses a MAML-style outer loop, optimizing for fast, effective in-situ test-time adaptation. At inference, only one gradient step per frame on $L_\text{self}$ is required when operating in the implicit memory regime.

Test-Time Adaptation Variants:

Explicit-memory (EX): Reset adaptation parameters at each frame and adapt over a local window.
Implicit-memory (IM): Carry adaptation state forward, updating incrementally.

VOC (Value Order Correlation, Spearman ρ between progress and time) illustrates that implicit memory adaptation yields substantially higher OOD performance (e.g., $0.8203$ vs $0.0423$ for DeepThought tk_pnp), demonstrating the advantage of incremental, stateful test-time adaptation for RPE (Ziakas et al., 11 Jun 2025).

4. Phase Segmentation and Remaining-Time Estimation

For processes with well-defined sequential phases, RPE supports phase inference and remaining-time estimation:

Phase Segmentation (Li et al., 2017):
- Probabilistic Modeling: Each phase $k$ is modeled as a Gaussian over completeness values, with parameters $(\mu_k, \Sigma_k, w_k)$ fit to ground-truth data.
- Phase Prediction: At test time, given estimated completeness $x$ , select argmax of log-likelihood among phase Gaussians.

$\hat{p} = \arg\max_{1 \le k \le K} \ell_k(x)$

where

$\ell_k(x) = \log(w_k) - \tfrac{1}{2} \log(\det(2\pi\Sigma_k)) - \tfrac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1}(x-\mu_k)$

Remaining-Time Estimation:

$t = \frac{\tau}{\rho}(1 - \rho)$

with $\rho$ the current estimated completeness and $\tau$ elapsed time. This simple formula enables consistent real-time remaining time prediction, which is robust in settings where process durations vary.

A plausible implication is that with phase-inferred time estimation anchored in completeness, these models offer robust temporal reasoning even under pronounced process speed fluctuations or irregular event timings.

5. Evaluation Metrics and Empirical Performance

Evaluation schemes consistently employ both regression and classification metrics:

Dataset/Task	Completeness MAE	Phase Accuracy	F1-score	Remaining-Time MAE
Trauma Resuscitation	12.65%	86.06%	0.67	7.5 min (14%)
Olympic Swimming	6.32%	87.99%	0.58	2.2 min (18%)

(Li et al., 2017) further applies precision, recall, 2SET metrics, and Matthews correlation coefficient for segmentation analysis. Empirical comparisons to SVM, Random Forest, and previous CNN/classical systems (e.g., EndoVis, LapChole) validate the state-of-the-art performance achieved by the fusion of CNN-LSTM spatio-temporal modeling with GMM-based phase decoding.

(Ziakas et al., 11 Jun 2025) reports VOC as the primary metric. Implicit-memory adaptation (TTT-IM) achieves VOC in the range $0.60$–$0.82$ across out-of-distribution domain/embodiment shifts, substantially outperforming CLIP zero-shot regression, in-context Gemini 1.5 Pro, and non-adaptive baselines.

6. Domain Generalization and Robustness

RPE methods leveraging relative, rather than absolute, progress support automatic adaptation to variable process durations and speeds. In (Li et al., 2017), normalized output ensures logical ordering even when absolute speed differs, and phase segmentation via learned GMMs mitigates noise and sensor irregularities.

(Ziakas et al., 11 Jun 2025) further extends domain robustness using semantic conditioning and test-time adaptation. By constructing embeddings that jointly encode visual observations and language descriptions, and enabling self-supervised fine-tuning during inference, the method enables generalization from a single training environment to unseen tasks, robots, and contexts, outperforming in-context learning approaches such as Gemini-based vision-LLMs.

Retention of adaptation state over time, rather than per-frame resets, is crucial for robust long-horizon RPE. Ablations demonstrate that meta-learned test-time updates (implicit memory) are necessary for high correlation with true process order across OOD domains and robot embodiments.

7. Limitations and Directions for Future Work

Current RPE systems largely operationalize time-based normalization for completeness, i.e., $t/T$ . While this is effective in domains where process progress aligns with elapsed time, it may be suboptimal in scenarios where progress is better represented by the completion of specific activities or semantic subgoals. The authors note that alternative, semantically grounded labeling strategies (e.g., activity-based completeness) remain an open research direction (Li et al., 2017).

Further, the efficacy of self-supervised adaptation in (Ziakas et al., 11 Jun 2025) depends on the quality of the learned embedding space and the ability of $L_\text{self}$ to encode semantic task progression across diverse domains. This suggests joint optimization of embedding and adaptation heads—or integration with more advanced temporal architectures—could yield further improvements.

Broadly, RPE frameworks that combine spatio-temporal multimodal modeling, normalization to relative progress, and online self-supervised adaptation constitute the state of the art for process progress understanding in multi-domain sequential tasks.

PDF Markdown Chat (Pro)

References (2)

Progress Estimation and Phase Detection for Sequential Processes (2017)

Test-Time Adaptation for Generalizable Task Progress Estimation (2025)

Follow Topic

Get notified by email when new papers are published related to Relative Progress Estimation (RPE).