Act2Goal: Agent Goal Inference

Updated 30 December 2025

Act2Goal is a framework for inferring agent goals from observed actions using planning-based, model-free, and vision–language methods.
It employs both symbolic planning and deep reinforcement learning to map action trajectories to probable goals in diverse, dynamic settings.
Real-time integration of goal recognition, activity recommendation, and policy adaptation enhances decision-making in robotics, human-computer interaction, and process mining.

Act2Goal encompasses a range of approaches for inferring, modeling, and exploiting agent goals from observed actions or action traces. Across fields including robotics, process mining, video activity recognition, and human-computer interaction, Act2Goal formulations combine low-level observations with explicit or implicit reasoning about goal states, leading to actionable policies, improved user modeling, or real-time assistance. This article consolidates research contributions under the Act2Goal nomenclature as found in deep reinforcement learning, symbolic planning, vision–language modeling, automated recommendation, and plan recognition (Zhou et al., 29 Dec 2025, Agarwal et al., 2022, Amato et al., 2019, Nageris et al., 2024, Berkovitch et al., 2024, Granada et al., 2020, Roy et al., 2022, Borrajo et al., 2020, Vallurupalli et al., 2024).

1. Formal Problem Definitions and Paradigms

Act2Goal problems are structured as the inverse of traditional planning or policy learning: rather than mapping goals to action sequences, the system must map observed action traces (O) to probable goals (G). Common formalizations include:

Plan Recognition: Given $O = \langle o_1, o_2, …, o_T \rangle$ (actions) and $\mathcal{G} = \{G_1, …, G_p\}$ (candidate goals), compute $Pr(G_i | O)$ using either planning-based (model-based) or sequence-learning (model-free) approaches (Borrajo et al., 2020).
Goal Identification from Trajectories: For UI settings, let $T = \{(s_1, a_1), …, (s_n, a_n)\}$ be a trajectory of (state, action) pairs; learn $f_\theta(T) \to g$ with $g$ a natural-language user goal (Berkovitch et al., 2024).
Policy–Conditioned Goal Recognition: Learn one actor–critic policy $\pi_g$ per goal $g$ , then assign posterior beliefs to $g$ by scoring observed trajectories under each policy (Nageris et al., 2024).
Goal-Oriented Next Activity Recommendation: Formulate next-activity selection in business processes as an MDP with reward structures tied to goal satisfaction and process conformance (Agarwal et al., 2022).
Vision-Language and Model-Free Approaches: Infer abstract goal representations $z_t$ in latent space conditioned on observed features $X_{1:t}$ , then use goal-consistency criteria to anticipate future actions (Roy et al., 2022).
Active Goal Recognition (AGR): Model the observer’s combined information-gathering and task accomplishment as a POMDP $(S, A_{obs}, \Omega, T, O, C, b_0, H)$ , supporting cost-sensitive sensing and goal declaration (Amato et al., 2019).

This diversity highlights the centrality of Act2Goal as an abstraction for rationalizing and forecasting agent behavior based on observed evidence in context-sensitive environments.

2. Principal Methodological Families

Several methodological classes are prevalent:

Planning-Based Recognition: Compute noisy rational likelihoods $Pr(O | G_i) \propto \exp[-\beta (C_O(G_i) - C^*(G_i))]$ , where $C_O$ includes observation constraints and $C^*$ ignores them. Bayesian posteriors select the most plausible goals (Borrajo et al., 2020).
Landmark-Based Heuristics: Use fact-landmarks $L(G_i)$ associated with each goal; the proportion $|L(G_i) \cap O| / |L(G_i)|$ serves as a goal-likelihood proxy (Borrajo et al., 2020).
Sequence Learning (Model-Free): LSTM networks or ensembles (XGBoost) classify goal labels from discrete action traces (Borrajo et al., 2020, Granada et al., 2020).
Actor–Critic Deep RL: Learn per-goal policy and value function pairs; score observed trajectories using critic-based values, Wasserstein distances on actions, or Z-score metrics for continuous domains (Nageris et al., 2024). Posterior probabilities are assigned by soft-min normalization over scores.
Vision–LLMs: Use frozen or fine-tuned large multimodal models to map trajectory sequences or images to goal descriptions, with specialized evaluation methods addressing paraphrase ambiguity and environment-specific intent (Berkovitch et al., 2024, Roy et al., 2022).
Active Sensing (AGR): Explicitly optimize trade-offs between sensing costs and progress on own tasks within a joint POMDP; derive policies that balance information-gathering against action execution (Amato et al., 2019).
Deep Policy for Goal-Conditioned Manipulation: Use goal-conditioned world models (Video DiT, 3D-VAE) to generate intermediate visual states bridging current and goal observations; Cross-attention integrates multi-scale latent representations for motor policy execution (Zhou et al., 29 Dec 2025).

3. Evaluation Metrics and Benchmark Results

Evaluation schemes vary with paradigm, but characteristic metrics include:

Plan Recognition Accuracy and Inference Time: Planner-based and learning-based approaches are compared on accuracy (fraction of correct goal assignments), F1 score, and runtime per inference (Borrajo et al., 2020, Nageris et al., 2024).
Goal Satisfaction and Conformance: In recommendation settings, measure rate of recommendations satisfying goal constraints ( $GS_{pred}$ ), process trace conformance ( $C\%$ ), and recovery rate from goal violation to satisfaction ( $GV \to GS$ ) (Agarwal et al., 2022).
Human–Model Agreement and Paraphrase Matching: Measure “match” rate for model-generated vs gold goals using human annotations and satisfaction-based paraphrase metrics; F1 scores indicate correlation with manual judgments (e.g., F1=0.75 for automatic evaluator vs manual in UI goal identification) (Berkovitch et al., 2024).
Vision–Language Modeling Gains: Reported top-1/5 accuracy improvements on verb, noun, and action anticipation benchmarks for egocentric video datasets (EK55, EGTEA), with absolute gains up to +13.69% (Roy et al., 2022).
Robot Manipulation Success Rates: Real-world and simulation success rates (e.g., 30%→90% for out-of-distribution tasks after online reward-free adaptation) and ablation results isolating components such as Multi-Scale Temporal Hashing (Zhou et al., 29 Dec 2025).
Narrative Goal Inference: SAGA computes Fleiss-Kappa IAA for goal annotation (average 0.80), F1-metrics on goal applicability and satisfaction inference, and human-rater scores for coherence and faithfulness (Vallurupalli et al., 2024).

These results empirically validate the performance benefits and limitations of competing Act2Goal systems.

4. Integrative and Real-Time Pipelines

Several implementations integrate recognition, forecasting, and policy selection in real time:

Hybrid Video Activity Recognition (HAPRec): CNN-based action recognizer yields per-frame activities; symbolic plan recognizer matches observed prefixes to candidate plans, incrementally updating goal ranking. Correct goal identification rates reach 90–95% prior to action sequence completion (Granada et al., 2020).
Goal Recognition in Human-Computer Interaction: Act2Goal approaches in UI settings map multimodal action–state traces to natural language goals; current LMMs underperform relative to humans, suggesting further need for fine-tuning and improved visual-language grounding (Berkovitch et al., 2024).
Active Goal Recognition (AGR): Policies derived from SARSOP POMDP solvers balance observer’s task rewards, sensing costs, and penalties for incorrect goal inference; empirical evaluations confirm near-upper-bound returns with minimized observation cost (Amato et al., 2019).
General Goal-Conditioned Manipulation: Act2Goal world-model generates imagined state sequences towards a dense/sparse decomposition; policy integrates these with proprioception via cross-attention and supports reward-free online adaptation via hindsight goal relabeling (LoRA). Success rates dramatically improve after minutes of autonomous adaptation on physical robots (Zhou et al., 29 Dec 2025).

5. Limitations, Trade-offs, and Foundational Insights

Key limitations and trade-offs are documented:

Model-Based vs Model-Free: Model-based techniques involve slow planner calls but properly exploit domain theory; model-free methods require large labeled data sets but yield rapid inference and natural handling of irrational or noisy behavior (Borrajo et al., 2020).
Scaling and Generalization: Off-the-shelf LMMs in UI, narrative, or vision settings fail to fully capture subtle intent, goal blocking, or context-dependent goal shifts. Fine-tuning on annotated data consistently yields higher negative-class detection, faithfulness, and intentionality (Vallurupalli et al., 2024).
Robustness in Continuous Domains: Actor–critic approaches outperform tabular methods on scalability, memory footprint, and resilience to noise or partial observation. Stochastic policies guarantee soft penalties on missing data and enable generalization over state–action similarity (Nageris et al., 2024).
Active Sensing Cost-Benefit: AGR formulations demonstrate principled exploration–exploitation balancing, delaying high-cost observations until expected information gain justifies interruption of task progress (Amato et al., 2019).
Reward Structuring for Multiple Objectives: Multi-goal reward decomposition in next-activity recommendation handles conflicting metrics (e.g., completion time and process outcome) by tunable per-step bonuses, thereby producing strictly conformant but goal-efficient traces (Agarwal et al., 2022).
Commonsense Goal Inference: Subtle edits to action descriptions in narratives can flip inferred goals or plans. SAGA shows that fine-tuned mid-size models can outperform larger LMs on specific inference tasks given high-quality annotations (Vallurupalli et al., 2024).

6. Future Directions and Open Challenges

Research recognizes several forward paths:

Fine-Tuning Multimodal Agents: Directly train on trajectory–goal pairs to learn robust satisfaction relations and context-dependent paraphrasing for goal descriptions (Berkovitch et al., 2024).
Cross-Paradigm Systems: Hybridize planning-based, learning-based, and policy-based approaches to maximize trade-off management and robustness to noisy, ambiguous environments (Borrajo et al., 2020).
Ambiguity-Aware Modeling: Design models that predict ranked sets of plausible goals or support reasoning about goal applicability as environments change (Vallurupalli et al., 2024).
Online Policy Self-Improvement: Enable reward-free adaptation through hindsight relabeling and rapid low-rank adaptation modules for fast improvement without external supervision (Zhou et al., 29 Dec 2025).
Expanded Evaluation: Quantify downstream benefits in personalized assistance, workflow optimization, or agent collaboration using synthesized and real-world data (Berkovitch et al., 2024, Agarwal et al., 2022).

Collectively, Act2Goal research continues to expand methodology and domain coverage, providing essential structure for interpretable, adaptive, and scalable agent behavior understanding.