Real-World Reinforcement Learning of Active Perception Behaviors (2512.01188v1)
Abstract: A robot's instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today's standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, asymmetric advantage weighted regression (AAWR), exploits access to "privileged" extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a "generalist" robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks. Website: https://penn-pal-lab.github.io/aawr/
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This paper is about teaching robots to “look around on purpose” so they can find what they need before acting. This is called active perception. For example, if a robot’s camera can’t see a toy because it’s hidden on a shelf, the robot should first move its camera to search smartly, then reach for the toy. The authors introduce a simple way to train robots to do this in the real world, called AAWR, that learns faster and works better than common methods.
The main questions the paper asks
The authors focus on four plain questions:
- How can a robot learn to gather the right information (by moving and looking) when its sensors don’t show everything?
- Can we train these “search before act” behaviors efficiently on real robots, not just in simulation?
- Is there a smart way to use extra sensors during training to guide learning, even if those sensors won’t be on the robot later?
- Will this help today’s generalist robot policies (big pre-trained controllers) that often fail at search tasks?
How the method works (in everyday language)
Think of training like practice games with a coach:
- During practice, the coach can see everything (like the exact location of the toy), even if the player (the robot) only sees a camera image.
- The coach gives better feedback about which moves were truly good, because the coach knows the hidden information.
- On game day, the player must play without the coach’s extra information, but the earlier coaching helped them learn the right habits.
That’s the idea behind AAWR (Asymmetric Advantage Weighted Regression):
- “Asymmetric” means that during training, the “judge” (a value/critic network that scores actions) is allowed to use extra “privileged” sensors or labels (like object positions or segmentation masks). The policy (the robot’s controller) only uses the normal sensors it will have at test time (e.g., wrist camera, joint angles).
- “Advantage Weighted Regression” is a smart kind of imitation. Imagine you copy actions from a dataset, but you copy “good” actions more than “bad” ones. The method learns to assign bigger weights to actions that led to better outcomes. The advantage is like “how much better was this action than average in this situation?”
How training happens:
- Start with some demonstrations, even if they’re not perfect, and a basic policy (like a generalist robot controller).
- Train offline (from recorded data) so the critic learns to judge actions using the extra sensors, and the policy learns to favor high-scoring actions using its normal inputs.
- Optionally, fine-tune online (the robot tries things in the real world and keeps learning) to improve search behavior.
- At deployment, the robot runs only the policy with its regular sensors—no extra sensors needed.
Why this helps: In partially observable tasks (you can’t see everything), it’s hard to tell which actions were truly good. Letting the critic peek at extra information during training makes its feedback much more accurate, so the policy learns the right search behaviors faster and more reliably.
What they found and why it matters
Across 8 tasks (in simulation and on 3 real robots), the method learned strong active perception behaviors:
- It beat standard behavior cloning (plain imitation) and a version of the same algorithm without privileged info.
- In a simulated “find-then-pick” task with a wrist camera, AAWR was the only method to reach nearly 100% success by learning to scan the workspace first, then move to grasp.
- In a real “blind pick” task (the robot mainly used its joint sensors), AAWR showed big gains in grasp and pick success after fine-tuning online.
- For shelf and cabinet search tasks in the real world, AAWR quickly learned sensible scan patterns (e.g., zooming out to see multiple shelves, sweeping up/down and left/right, checking likely hiding spots) and then handed off to a generalist policy to grasp. It was both more successful and faster than baselines like:
- a non-privileged learner,
- plain imitation,
- an “exhaustive search” script (which was thorough but slow),
- and a vision-LLM prompting approach that tried to guide a generalist policy with language.
Why it matters: Many real-world robot failures come from not seeing the right thing at the right time. Teaching robots to actively gather information before acting makes them more reliable in messy, cluttered environments—like homes and warehouses—without requiring expensive, perfect sensors at test time.
What this could change in the future
- Smarter, more reliable robots at home and work: Robots could check drawers, scan shelves, or move around obstacles to see better—then act. This makes them useful in more realistic settings.
- More efficient training in the real world: By using extra sensors or labels only during training, we reduce the need for super-accurate simulation and still get strong real-world behavior.
- Better use of generalist policies: AAWR can “handhold” big pre-trained robot policies by doing the search part first, then letting the generalist finish the task. Over time, we could directly fine-tune those big policies to include search skills.
- Beyond manipulation: The same idea—learn with privileged feedback, act with normal sensors—could help drones, self-driving, or any task where seeing everything is hard.
Simple limitations and next steps:
- You still need some extra information during training (like object masks or positions), which might take effort to collect.
- Current experiments often “switch” from a search policy to a grasp policy; a future goal is a single end-to-end policy that does both.
- Longer, more complex tasks will need even stronger memory and planning, which is a promising direction.
In short, this paper shows a practical recipe for teaching robots to look before they leap—and it works in the real world.
Knowledge Gaps
Unresolved knowledge gaps, limitations, and open questions
Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper. Each point is phrased to guide follow‑up research.
- Finite-sample theory: Provide convergence guarantees, sample complexity bounds, and error analyses for AAWR with function approximation and off-policy data, beyond the idealized objective derivation.
- Privileged signal fidelity: Quantify AAWR’s sensitivity to privileged sensor errors (noise, misalignment, latency, false detections), including systematic robustness experiments and ablations.
- Privileged vs. unprivileged mismatch: Formalize conditions under which noisy or non-Markov privileged observations reliably approximate state in AAWR; extend the theoretical derivation to and analyze aliasing effects.
- Hyperparameter sensitivity: Systematically paper the impact of expectile parameter (in IQL), temperature (advantage weighting), and weight clipping choices on stability and performance.
- Critic choice ablations: Compare IQL-based critics to alternatives (e.g., CQL, AWAC, TD3+BC, conservative value learning) under identical settings to identify the most reliable critic for AAWR in POMDPs.
- Memory design: Evaluate different agent-state architectures (history window size, GRU/LSTM/Transformer, belief-state networks) and quantify how memory capacity affects active perception performance in longer-horizon tasks.
- Belief tracking: Investigate explicit belief-state estimation (e.g., learned filters) within AAWR and compare against recurrent policies that implicitly track beliefs, including conditions where SAWR might suffice.
- Reward design constraints: Demonstrate AAWR performance with sparse rewards or preference-based feedback in real active perception tasks to reduce reliance on dense instrumentation.
- Safety in online finetuning: Incorporate safe exploration (constraints, risk-aware objectives) and report safety incidents (collisions, near misses) during online training on real robots.
- Switching logic reliability: Analyze the handoff criterion to the generalist policy (e.g., detector confirmations across intervals), quantify false positives/negatives, and design robust switching mechanisms under detection uncertainty.
- End-to-end fine-tuning of generalists: Test AAWR for direct fine-tuning of foundation VLA policies (π0) instead of relying on helper policies; paper interference, catastrophic forgetting, and task retention.
- Integrated hierarchical control: Replace heuristic switching with learned hierarchical policies (e.g., options, subgoals) that jointly optimize search and manipulation; compare to hand-engineered exhaustive scans.
- Privileged modality selection: Identify the minimal set of privileged signals required per task; perform ablations across masks, bounding boxes, depth, tactile, audio, and LLM outputs to quantify contribution and cost.
- Scaling privileged data acquisition: Develop strategies to obtain privileged labels cheaply (weak supervision, self-supervision, auto-annotation via VLMs or simulation), and measure annotation cost vs. performance gains.
- Generalization under domain shift: Evaluate AAWR across different scenes, lighting, object sets, occlusion patterns, and robot embodiments; report cross-domain and cross-robot transfer results.
- Long-horizon scalability: Stress-test AAWR on tasks with compounded information-gathering (opening/closing drawers, multi-room search, mobile manipulation) and measure degradation over horizon.
- Interactive perception breadth: Extend beyond “scan-to-find” to actions that actively alter occlusions (opening doors, moving clutter) and quantify how AAWR handles contact-rich, compliant interactions and tactile feedback.
- Open-world object search: Test AAWR without predefined target classes, using open-vocabulary detectors or VLMs as privileged signals; report robustness to unknown objects and detector drift.
- Path planning and spatial memory: Integrate spatial memory and cost-aware path optimization (coverage planning) into AAWR; measure search efficiency vs. exhaustive baselines with trajectory length and energy metrics.
- Baseline coverage: Add strong POMDP baselines (e.g., information-gain RL, world models/MBRL, uncertainty-aware planners) to isolate AAWR’s advantages over task-agnostic active vision.
- Offline-to-online mixing strategy: Explore the ratio and scheduling of offline/online updates, replay prioritization, and data freshness; quantify sample efficiency across budgets and environments.
- Compute and latency: Report training/inference time, on-robot latency, and resource needs; paper how computational constraints impact AAWR’s deployment viability.
- Failure mode taxonomy: Provide a detailed analysis of failure cases (tracking loss, suboptimal paths, manipulation slips), align them with diagnostics (advantage miscalibration, detector errors), and propose mitigation strategies.
- Metric standardization: Validate the custom “Search” rubric with inter-rater agreement, add confidence intervals to real-world metrics, and propose a benchmark suite for active perception under partial observability.
- Distillation vs. AAWR interplay: Investigate hybrid pipelines that train privileged experts then continue with AAWR for online improvement; measure how distillation initializations impact exploration and final performance.
- Advantage weighting robustness: Study over-weighting of noisy advantages, alternative weighting schemes (e.g., tempered exponentials, clipped advantages), and the effect on stability in off-policy settings.
- Calibration of “uncalibrated cameras”: Assess how extrinsic/intrinsic camera calibration quality impacts AAWR, especially in cluttered scenes requiring precise viewpoint control.
- Closed-loop manipulation integration: Move beyond open-loop grasping by incorporating contact feedback and closed-loop controllers; quantify improvements in completion when combined with AAWR-driven search.
- Multi-task generalist active perception: Train a single AAWR policy across multiple tasks and environments; evaluate task interference, transfer, and scaling to broader skill repertoires.
Glossary
- Active perception: Information-gathering behaviors where an agent moves sensors or interacts with the environment to improve sensing for a task. "We propose a simple real-world robot learning recipe to efficiently train active perception policies."
- Advantage: The performance gain of taking an action compared to the policy’s baseline value at a state. "aid in estimating the advantage of the target policy."
- Advantage Weighted Regression (AWR): A policy iteration algorithm that updates a policy via behavior cloning weighted by estimated advantages. "Advantage weighted regression (AWR) \citep{neumann2008fitted,peng2019advantage} is a policy iteration algorithm for fully observed MDPs"
- Agent state: A compact, recurrent representation of history used to condition policies in POMDPs. "it is common to consider an “agent state” that is recurrent"
- Asymmetric Advantage Weighted Regression (AAWR): An AWR variant that uses privileged information for critics during training while the policy receives partial observations. "We call this approach Asymmetric AWR (AAWR)."
- Asymmetric learning paradigm: Training regime where extra state or sensors are available to critics during training but not at deployment. "We consider the asymmetric learning paradigm in which the environment state available during training (offline or online) but not during policy deployment."
- Bellman equations: Recursive equations defining the fixed point for value functions under a given policy. "we show that the privileged value functions are the fixed point of the Bellman equations described by IQL's objective."
- Behavior cloning (BC): Supervised imitation learning that mimics actions from demonstrations without considering reward. "Next, we compare against standard behavior cloning (BC), which performs imitation learning on the successful trajectories in the dataset."
- Behavior policy: The (mixture) policy that generated the data used to estimate advantages and train the target policy. "The behavior policy typically corresponds to the mixture of all past policy iterates that generated the dataset of online interactions ."
- Critic: A learned function (e.g., Q-function or value function) that evaluates actions or states to guide policy updates. "we give critics privileged access to object detectors to train open-loop policies"
- Discount factor: A scalar that weights future rewards relative to immediate rewards in the return. "where the discount factor weights the importance of future rewards."
- Distillation: Transferring knowledge from a privileged expert policy to a non-privileged student policy. "we compare AAWR against Distillation \citep{chen2023sequential}, which first trains a privileged expert policy and then distills it into a partially observed policy."
- Equivalent MDP: Reformulation of a POMDP into a fully observed MDP by augmenting the state with the agent state. "the POMDP can be transformed into an equivalent MDP whose state includes both the environment state and the agent state"
- Expectile regression: A regression objective used to learn value functions by emphasizing higher returns, as in IQL. "The networks are trained using IQL's expectile regression objective, see \cref{app:aawr_implementation} for details."
- Generalist robot policy: A broad, foundation model-based policy trained on diverse teleoperation data that may struggle with active perception. "When initialized with a “generalist” robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors"
- Implicit Q-Learning (IQL): An offline/offline-to-online Q-learning algorithm that learns value functions via expectile regression. "we choose IQL \citep{kostrikov2022offline}, a well known Q-learning algorithm known for its effectiveness in offline RL, offline-to-online RL finetuning \citep{park2024ogbench} and real robot RL \citep{feng2023finetuning} tasks."
- Initial state density: The distribution over initial environment states in a POMDP. "and the initial state density ."
- Kullback–Leibler (KL) constraint: A bound on policy divergence from the behavior policy during updates. "under KL constraint ."
- Lagrangian relaxation: Converting a constrained optimization into an unconstrained one using a multiplier. "the Lagrangian relaxation with Lagrangian multiplier of the following constrained optimization problem"
- Markov Decision Process (MDP): A fully observed decision process where the current state suffices for optimal actions. "Advantage weighted regression (AWR) \citep{neumann2008fitted,peng2019advantage} is a policy iteration algorithm for fully observed MDPs"
- Monte Carlo estimation: Estimating values or returns by averaging sampled trajectories. "learning a value function with Monte Carlo estimation."
- Observation density: The distribution of observations conditioned on environment states. "the observation density "
- Occupancy measure (): The state (or state-agent-state) distribution induced by a policy, used in expectations for policy updates. ""
- Off-policy: Learning from data not generated by the current policy being optimized. "which improves sample efficiency by better leveraging off-policy samples."
- Offline RL: Reinforcement learning using a fixed dataset without further environment interaction. "known for its effectiveness in offline RL, offline-to-online RL finetuning \citep{park2024ogbench} and real robot RL \citep{feng2023finetuning} tasks."
- Offline-to-online RL: Pretraining on offline data followed by online finetuning with interaction. "We follow the offline-to-online RL paradigm \citep{nair2020awac,lee2022offline,kostrikov2022offline,feng2023finetuning,nakamoto2023cal,yu2023actor}"
- Open-loop policy: A policy that executes actions without closed-loop feedback on observations (or with limited sensing). "train open-loop policies that only receive proprioception and initial object positions."
- Partially Observable Markov Decision Process (POMDP): A decision process where the agent only receives partial observations of the true state. "are naturally modelled by partially observed Markov decision processes (POMDPs)~\cite{kaelbling1998planning}"
- Policy improvement: Increasing expected return by updating a policy using advantage or value estimates. "maximizes the expected surrogate improvement, %"
- Policy iteration: Alternating evaluation and improvement steps to converge to an optimal policy. "Advantage weighted regression (AWR) \citep{neumann2008fitted,peng2019advantage} is a policy iteration algorithm"
- Privileged information: Extra training-time-only signals (e.g., state or sensors) unavailable at deployment to help learning under partial observability. "exploiting privileged information \citep{vapnik2009new} during training time to improve policy training"
- Privileged sensors: Additional sensing modalities available during training to critics/value functions but not at test time. "exploits access to “privileged” extra sensors at training time."
- Proprioception: Internal sensing of the robot’s joints, positions, and forces. "ranging from entirely blind robots operating purely from proprioception"
- Q-function: A critic estimating expected return for state-action pairs under a policy. "by learning a Q-function with TD learning"
- Q-learning: Temporal-difference learning of action-value functions to derive optimal policies. "a well known Q-learning algorithm known for its effectiveness in offline RL"
- Reward density: The distribution of rewards conditioned on states and actions. "the reward density "
- Sim-to-real transfer: Transferring policies learned in simulation to real-world robots. "Moreover, sim-to-real transfer is hard for such tasks"
- Surrogate improvement: A proxy objective for policy improvement used in AWR. "maximizes the expected surrogate improvement, %"
- TD learning: Temporal-difference methods that bootstrap value estimates from subsequent predictions. "by learning a Q-function with TD learning"
- TD(λ): A temporal-difference algorithm that blends multi-step returns with parameter λ. "either a return-based estimate or a estimate of the advantage"
- Transition density: The dynamics model specifying state transitions given current state and action. "the transition density "
- Value function: A critic estimating expected return from states (or agent states). "privileged value functions that aid in estimating the advantage of the target policy."
- Variational Information Bottleneck (VIB): A regularization approach that constrains information flow, here used to control privileged inputs to the policy. "We also compare against a variational information bottleneck approach (VIB) \citep{hsu2022visionbased}"
- Vision-LLM (VLM): A multimodal model that interprets images and text to generate instructions or actions. "a VLM+$\pi_{0$} variant that queries the Gemini-2.5 VLM \citep{team2023gemini}"
- Vision-Language-Action (VLA) policy: A foundation model policy that integrates visual, language, and action modalities for robotic control. "Handholding Foundation VLA Policies for Real Active Perception tasks."
Practical Applications
Immediate Applications
The following applications can be deployed now using the paper’s AAWR method and workflows demonstrated on real robots and simulated tasks. Each item includes its sector(s), potential tool/product/workflow, and key assumptions/dependencies that impact feasibility.
- Active Perception Add-on for Warehouse Picking Robots
- Sectors: robotics, logistics/warehouse, manufacturing
- Tool/Product/Workflow: A modular “search-before-grasp” policy that scans bins/shelves to reveal occluded items, then hands off to an existing grasping policy (e.g., a foundation VLA); packaged as a ROS2-compatible AAWR module with a policy switcher and detector hooks.
- Assumptions/Dependencies:
- Access to privileged signals during training (e.g., object masks/bounding boxes, external cameras, or rough object location annotations) from tele-op logs or limited instrumentation.
- Reliable object detection at deployment for triggering handoff; safety-certified motion control in clutter.
- Limited on-robot compute for recurrent agent state and detectors; reward functions to supervise search behavior.
- Retail Shelf Auditing and Restocking Assistance
- Sectors: retail, robotics, computer vision
- Tool/Product/Workflow: A mobile-manipulation robot that actively scans shelves (vertical and horizontal sweeps) to locate SKUs and verify placement/availability; switches to a restocking pick policy when an item is found.
- Assumptions/Dependencies:
- Training data with privileged annotations (masks/bboxes or staff-verified item locations).
- Detector performance robust to real store lighting/occlusions; integration with store inventory systems.
- Hospital Supply Room “Find-and-Fetch” Robots
- Sectors: healthcare, robotics
- Tool/Product/Workflow: A hospital assistant robot that searches shelves, cabinets, and drawers to locate supplies in clutter and hand off to a grasp policy; supports staff through time-critical retrieval tasks.
- Assumptions/Dependencies:
- Privileged training signals from staged environments or annotated videos; clinical safety, sterility, and privacy constraints; reliable detection of medical supplies.
- Home Service Robot “Find Lost Item” Skill
- Sectors: consumer robotics, daily life
- Tool/Product/Workflow: A search skill for home robots that scans bookshelves, drawers, and floors to locate small objects (keys, toys, remotes), then hands off to grasp or pointing behaviors.
- Assumptions/Dependencies:
- Privileged training data collected at setup or via user annotation; robust detectors for household items; failure-safe motion in tight spaces.
- Blind Grasping via Proprioception When Cameras Are Occluded or Unavailable
- Sectors: manufacturing, robotics
- Tool/Product/Workflow: An AAWR-trained open-loop pick policy using joints + initial object position estimates to recover from camera failures or heavy occlusions (as demonstrated in “Blind Pick”).
- Assumptions/Dependencies:
- Initial object location estimates available; repeatable fixtures; reward shaping for grasp success; safety interlocks for open-loop motion.
- Foundation VLA “Perception Shepherd” Wrapper
- Sectors: software for robotics, generalist manipulation
- Tool/Product/Workflow: A thin helper module that runs AAWR-trained active perception to guide a generalist policy (π0) to a good viewpoint, then switches control; includes policy switching logic and detector integration as in the paper’s handoff framework.
- Assumptions/Dependencies:
- Access to π0 or other grasping skills; detector reliability thresholds to trigger switch; small offline datasets with suboptimal demos sufficient for AAWR.
- Academic Teaching and Benchmarking Toolkit for Real-World POMDP RL
- Sectors: academia, education
- Tool/Product/Workflow: Course-ready AAWR/IQL code, datasets, and tasks (bookshelves/cabinets, blind pick, simulated camouflaged objects) for labs teaching active perception, offline-to-online RL, and privileged critics.
- Assumptions/Dependencies:
- Lab access to robot arms or simulated environments; minimal instrumentation (uncalibrated RGB); ground-truth or annotated privileged signals for training.
- Inspection Robots for Industrial Facilities (e.g., racks, panels, small enclosures)
- Sectors: energy/utilities, industrial inspection, robotics
- Tool/Product/Workflow: An active scanning skill for locating indicators/parts behind clutter and occlusions (gauges, labels, connectors), followed by a task-specific manipulation or reporting step.
- Assumptions/Dependencies:
- Privileged training signals from commissioning; robust vision in harsh lighting; safe paths constrained by equipment layouts.
Long-Term Applications
The following applications need further research, scaling, or productization (e.g., longer horizons, reliability, safety, integrated tooling).
- End-to-End Finetuning of Foundation VLA Models with AAWR to Imbue Memory and Active Perception
- Sectors: robotics, software/ML platforms
- Tool/Product/Workflow: Directly fine-tune generalist VLA policies with AAWR (rather than handoff), using privileged critics to teach viewpoint selection, scanning, and fixation; integrates with robot learning stacks.
- Assumptions/Dependencies:
- Access to large-scale real-world datasets with privileged signals; stable finetuning pipelines; safety/compliance validation.
- Multi-Robot Coordinated Active Perception in Large Facilities
- Sectors: logistics/warehouse, manufacturing, smart buildings
- Tool/Product/Workflow: Teams of robots dividing search spaces and sharing privileged training signals to learn efficient global search policies; orchestration services for task allocation and map memory.
- Assumptions/Dependencies:
- Reliable multi-robot communication; fleet management; dataset breadth for generalization; robust detectors across zones.
- Surgical and Endoscopic Robotics with Active Viewpoint Control
- Sectors: healthcare, surgical robotics
- Tool/Product/Workflow: Active perception to optimize camera/endoscope viewpoints under occlusions (tissue, fluids), assisting surgeons with consistent visibility of target anatomy; could support autonomous camera holding.
- Assumptions/Dependencies:
- Regulatory approval; simulation-to-real transfer of privileged signals; safety and domain-specific reward design; high-fidelity sensing.
- Drones for Search-and-Rescue with Task-Centric Active Perception
- Sectors: public safety, defense, disaster response
- Tool/Product/Workflow: Active scanning around debris or complex terrain to locate persons or specific objects; task-centric AAWR policies that optimize success rather than generic information gain.
- Assumptions/Dependencies:
- Privileged training via annotated aerial datasets; robust detectors under weather/lighting; navigation safety in GPS-denied or cluttered environments.
- Representation Learning of Privileged Features (Foundation Model Outputs as Privileged Signals)
- Sectors: ML research, robotics
- Tool/Product/Workflow: Use segmentation, detection, language grounding, or radiance fields as privileged input during training; learn compact features to supervise partially observed policies.
- Assumptions/Dependencies:
- Access to high-quality foundation models; consistent labeling across domains; scalable training on real robot data.
- Long-Horizon Household or Facility Tasks Requiring Multi-Stage Information Gathering
- Sectors: consumer robotics, enterprise robotics
- Tool/Product/Workflow: Policies that search across rooms/shelves/drawers with memory of prior failures, plan revisits, and coordinate with manipulation; integrated task planning and AAWR-based perception modules.
- Assumptions/Dependencies:
- Strong recurrent/agent-state design; reliable switching between search/manipulation; extended training budgets; robust reward shaping for multi-step tasks.
- Standardization and Policy Guidance for Privileged Training Sensors in Real-World RL
- Sectors: policy/regulation, industry standards
- Tool/Product/Workflow: Best-practice guidelines on privacy, data retention, sensor placement, and annotation protocols for privileged training data (e.g., additional cameras used only during training, not deployment).
- Assumptions/Dependencies:
- Cross-industry collaboration; clear privacy frameworks; validation that privileged sensors are retired post-training; auditability.
- Shared Benchmarks and Datasets for Active Perception in Partially Observed Manipulation
- Sectors: academia, open-source community
- Tool/Product/Workflow: Public tasks (shelves, cabinets, blind pick) with standardized metrics (search, completion, latency) to compare AAWR, SAWR, BC, and other baselines across robots and sensors.
- Assumptions/Dependencies:
- Community buy-in; consistent data schemas; tooling to capture privileged signals ethically; hardware diversity to validate generalization.
Collections
Sign up for free to add this paper to one or more collections.