Cost-Aware Embodied Search

Updated 26 February 2026

Cost-aware embodied search is the optimization of sequential perception, navigation, and action for agents operating under explicit resource constraints using POMDP frameworks.
It models heterogeneous costs—physical, informational, and economic—to enable principled trade-offs between performance metrics like accuracy and operational expenditure.
State-of-the-art approaches integrate tree search, diffusion models, and reinforcement learning to balance exploration with cost, yielding scalable and efficient policies.

Cost-aware embodied search is the principled study and optimization of sequential perception, navigation, and action selection strategies for embodied agents under explicit resource constraints. The central objective is to locate, recover, or disambiguate targets or information in partially observable environments, while accounting for heterogeneous costs—physical (e.g., energy, time, maintenance), informational (e.g., cognitive load, uncertainty), or economic (e.g., opportunity costs, service-level penalties). Rigorous cost models, unified objective functions, and empirically validated algorithms now shape the field, with substantial technical developments appearing across active search, navigation, multimodal reasoning, and interactive AI planning.

1. Fundamental Problem Formulations

Cost-aware embodied search is typically formalized as a partially observable Markov decision process (POMDP), a stochastic control model that captures the agent's uncertainty, sequential actions, and the explicit cost structure:

$M = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, \Omega, R, C, \gamma \rangle$

Here:

$\mathcal{S}$ : latent world and agent states (e.g., positions, belief over targets, scene layout)
$\mathcal{A}$ : set of available actions (navigation, sensing, querying, communication, memory retrieval)
$\mathcal{O}$ : observation space (sensor readings, responses)
$T$ and $\Omega$ : environment and observation dynamics, often including partial observability and noisy measurements
$R$ : sparse or shaped task rewards (e.g., for target finding or correct task completion)
$C$ $C$ : explicit cost functions, which can be heterogeneous and history-dependent, modeling:
- physical travel/sensing effort ( $c_\text{nav}$ , $c_\text{sense}$ )
- communication (token count, delay)
- economic spend (hardware, maintenance, energy)
- human attention or cognitive load
$\gamma$ : discount factor

Policies $\pi$ are trained or optimized to minimize expected accumulated cost or, equivalently, to maximize expected return adjusted by cost penalties:

$J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=1}^{T}\gamma^t \left( r_t - C(a_t) \right) \right]$

The canonical objective is to synthesize agents that reliably solve the embodied search problem at minimal, often multi-criteria, cost (Seong et al., 25 Nov 2025, Banerjee et al., 23 Feb 2026, Zhou et al., 21 Dec 2025, Truong et al., 10 Jan 2026, Seo et al., 4 Feb 2026, Banerjee et al., 2022, Cho et al., 2 Feb 2026).

2. Modeling and Quantifying Costs

State-of-the-art research emphasizes quantitative models that reflect real-world operational economics, effort, and opportunity. For commercial settings such as delivery robotics, this includes revenue, hardware amortization, energy, collision-induced maintenance, SLA penalties, and human rescue costs. CostNav, for instance, models profit per delivery as:

$\text{Profit} = R - \left( \frac{C_\text{hardware} + C_\text{train}}{N_\text{runs}} + C_\text{run} \right)$

where $C_\text{run}$ encompasses energy and maintenance—empirically shown to be dominated (99.7%) by collision-induced repair costs under standard learning-based navigation (Seong et al., 25 Nov 2025).

In multi-modal or interactive settings, heterogeneous action costs structure the agent's choices:

$C(a_t) = \begin{cases} c_\text{nav} \cdot d(\cdot) & \text{if } a_t = \text{Navigate} \ c_\text{ask} \cdot (1 + \alpha N_\text{ask}(t)) & \text{if } a_t = \text{Ask} \ c_\text{mem} & \text{if } a_t = \text{GetMemory} \end{cases}$

(Zhou et al., 21 Dec 2025)

In communication- and reasoning-heavy domains, explicit penalties can be imposed for both movement (e.g., path length) and messaging (e.g., token budget), as in:

$C(a) = \alpha d(a) \mathbf{1}_{\{\mathrm{move}(a)\}} + \beta \ell(a) \mathbf{1}_{\{\mathrm{comm}(a)\}}$

(Seo et al., 4 Feb 2026)

This modeling enables rigorous trade-off analyses and allows the synthesis of rational policies under cost-sensitive objectives.

3. Algorithmic Approaches for Cost-Aware Embodied Search

Several algorithmic paradigms, spanning probabilistic planning, reinforcement learning, and neural sequence modeling, have been developed for cost-aware embodied search:

3.1. Lookahead Tree Search and Multi-Objective Planning

Monte Carlo Tree Search (MCTS) and Pareto optimization are used to construct online planners that balance information-gain against cost. CAST (Cost-aware Active Search of Sparse Targets) integrates Thompson Sampling for exploration, UCT-tuned tree search for lookahead, and Pareto-optimal lower-confidence bounds to trade off cumulative reward and cost (Banerjee et al., 2022):

Action selection at each step maximizes reward per unit cost on the Pareto-front.
Parallel decentralized agents maintain independent posteriors and share measurements asynchronously.
Empirically, CAST outperforms myopic and deterministic baselines as cost structures vary.

3.2. Diffusion-Based and Neural Policy Lookahead

CD-AS (Cost-aware Diffusion Active Search) replaces expensive search trees with amortized sequence modeling: a diffusion-based generative model samples $H$ -step action plans conditioned on the agent’s current belief, while a learned return estimator scores sequences for expected information-gain minus travel and sensing cost (Banerjee et al., 23 Feb 2026). To avoid optimism bias (e.g., hallucinated early target discoveries), trajectory diffusion is conditioned only on belief, and explicit distance penalties are included in guidance.

Similarly, LookaHES applies a nonmyopic, pathwise-sampled $H$ -Entropy Search where a recurrent policy (e.g., GRU/LLM) amortizes multi-turn acquisition in high-dimensional or structured action spaces (e.g., spatial search with travel costs), with cost-sensitive acquisition functions (Truong et al., 10 Jan 2026).

3.3. Reinforcement and Policy Optimization with Cost-Aware Objectives

HC-GRPO (Heterogeneous Cost-aware Group Relative Policy Optimization) directly optimizes trajectory-level cost-adjusted return in long-horizon reasoning with MLLMs. Unlike PPO, advantage estimates are computed intra-group without a critic, and the MLLM is updated to prefer reasoning+action traces that resolve ambiguity at minimal dialogue, retrieval, or navigation cost, as shown in ESearch-R1 (Zhou et al., 21 Dec 2025).

Resource-rational RL approaches, as in Sensonaut, learn or search over belief-state policies that explicitly balance information-gain, physical effort, elapsed time, and error penalties (Cho et al., 2 Feb 2026).

3.4. Structured Cost-Aware Reasoning with LLMs

PCE (Planner–Composer–Evaluator) parses LLM chains-of-thought into explicit scenario trees, assigning cost-sensitive utility scores to each action-hypothesis pair by combining scenario likelihood, conditional gain, and estimated execution cost. This cost-aware selection avoids unnecessary exploration or communication, yielding efficient plans in multi-agent, partially observable environments (Seo et al., 4 Feb 2026).

4. Empirical Evaluation and Benchmarks

Empirical studies have benchmarked cost-aware embodied search in diverse domains:

Domain	Baseline Metric	Cost-Aware Metric	Dominant Cost Driver
Delivery Robotics	Success rate	Profit, per-run cost	Maintenance (99.7%)
AI2-THOR Navigation	Success, steps	Total Task Cost (TTC)	Navigation, Ask > Memory
Geospatial Search	Final value	Steps, travel, cost	Travel distance
Audiovisual Search	Accuracy, time	Effort, error cost	Head turns, displacement
Multi-agent Active	Recovery rate	Total time/cost	Sensing vs travel

For instance, in CostNav, a policy with 43% SLA compliance is not commercially viable: expected profit per run is $-30.009$ USD, with maintenance from collisions accounting for almost all loss (Seong et al., 25 Nov 2025). In ESearch-R1, cost-sensitive MLLM agents halve operational cost while improving task success under ambiguous instruction following relative to ReAct-based agents (Zhou et al., 21 Dec 2025).

Empirical ablations reveal that naive success or completion rates fail to indicate commercial or operational utility unless cost structure is explicit; marginal gains in success can be offset by unsustainable operational spend.

5. Actionable Guidelines for Designing Cost-Aware Embodied Search Systems

Research has yielded domain-agnostic and domain-specific recommendations:

Explicitly model all relevant fixed and variable costs (hardware, training, energy, maintenance, human-in-the-loop) with parameters derived or calibrated from field data (Seong et al., 25 Nov 2025).
Design cost functions that reflect true operational resource usage. For heterogeneous actions (navigation, communication, memory), weight each according to relative effort or opportunity cost (Zhou et al., 21 Dec 2025, Seo et al., 4 Feb 2026).
Use scenario trees or structured CoT-to-action planners to balance uncertain exploration against cost-limited exploitation (Seo et al., 4 Feb 2026).
Leverage recurrence or sequence modeling (diffusion, GRUs, transformers) for amortized long-horizon lookahead, especially when tree search is computationally intractable (Banerjee et al., 23 Feb 2026, Truong et al., 10 Jan 2026).
Employ group-wise policy optimization and relative advantages to stably update policies in high-dimensional, cost-heterogeneous reasoning (Zhou et al., 21 Dec 2025).
In multi-agent settings, enable decentralized, asynchronous sharing of observations and schedule planning in parallel (Banerjee et al., 2022, Banerjee et al., 23 Feb 2026).
Tune key hyperparameters (cost ratios, Lagrange multipliers, scenario depth) via ablation, and re-scale inputs and outputs for robust GP or neural surrogates (Truong et al., 10 Jan 2026, Seo et al., 4 Feb 2026).
Regularly validate cost models and agent trajectories with empirical/human-in-the-loop experiments to capture unmodeled error drivers (e.g., occlusion, distractors, perception limits) (Cho et al., 2 Feb 2026).

6. Impact, Limitations, and Future Directions

Cost-aware embodied search shifts the focus from raw task success metrics to economically rational, operationally viable, and resource-efficient behavior. This enables principled comparison among rule-based, imitation learning, cost-penalized reinforcement learning, and human-in-the-loop approaches, and helps prioritize interventions (e.g., collision avoidance over path length minimization in navigation).

Current limitations include:

Sensitivity to under-modeled or dynamic costs (hardware aging, irregular human attention, or reward misspecification)
Computational challenges for exhaustive lookahead in large or continuous domains
Inference latency of large parameter MLLMs for cycle-critical applications (Zhou et al., 21 Dec 2025)
Open questions regarding regret bounds and theoretical guarantees in heterogeneous multi-agent cost settings (Banerjee et al., 2022)

Future directions emphasize meta-learning of cost functions, uncertainty-aware sensor integration, lightweight policy backbones for edge deployment, integration of multi-objective path planning under safety constraints, and formal economic viability certification (Seong et al., 25 Nov 2025).

7. Representative Algorithms and Comparative Results

To illustrate the algorithmic landscape and empirical advances, key approaches are summarized below.

Approach	Principal Components	Reported Improvements	Reference
CostNav Benchmark	Micro-Navigation, profit model	Reveals divergence in success vs. profit	(Seong et al., 25 Nov 2025)
HC-GRPO (ESearch-R1)	Critic-free group relative PO	~50% lower operational cost vs. ReAct	(Zhou et al., 21 Dec 2025)
CD-AS	Diffusion, gradient guidance	Optimal recovery with $30\%$ lower wall-clock	(Banerjee et al., 23 Feb 2026)
LookaHES	Multi-step HES, neural policy	Finds global max faster under cost constraints	(Truong et al., 10 Jan 2026)
Sensonaut	POMDP, leaky Bayes fusion	Human-like adaptation to search effort	(Cho et al., 2 Feb 2026)
PCE	CoT scenario tree, cost utility	Up to 30% reduction in steps vs. LLM baselines	(Seo et al., 4 Feb 2026)
CAST	MCTS + TS + Pareto front	Lowest total cost in multi-agent active search	(Banerjee et al., 2022)

These results mark cost-aware embodied search as a distinct, rapidly maturing research area that combines economic rigor with statistical learning and search, enabling robust, resource-rational embodied agents across diverse environments.