Papers
Topics
Authors
Recent
Search
2000 character limit reached

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

Published 21 May 2026 in cs.LG | (2605.22814v1)

Abstract: Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.

Summary

  • The paper presents a persistent 3D Gaussian Splatting model that continuously updates environmental maps and generates stable intrinsic rewards for exploration.
  • It employs a transformer-based episodic policy with sliding-window causal and global linear attention to leverage long-horizon memory and avoid local loops.
  • Empirical results demonstrate superior scene coverage and zero-shot transfer to downstream tasks, underscoring the modelโ€™s adaptability and intrinsic motivation efficiency.

Episodic Context and Persistent World Models for Curiosity-Driven Exploration in 3D Environments

Introduction

Curiosity-driven exploration has historically offered a principled solution for addressing sparse or delayed reward scenarios in reinforcement learning (RL), particularly in complex 3D environments. However, standard intrinsic motivation methods often suffer from a lack of persistent world memory and insufficient episodic trajectory context, leading to premature convergence to local behavior cycles and spurious novelty signals. In "Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration" (2605.22814), the authors advance the field by demonstrating that effective large-scale exploration is predicated on two requirements: a persistent, incrementally updated model of the environment, and a policy architecture capable of leveraging full episodic context via long-horizon memory.

Methodological Contributions

Persistent 3D World Model

The primary innovation is the use of an online 3D Gaussian Splatting (3DGS) system as an explicit, continuously updated forward model of the environment. This world model consumes privileged RGB-D and camera pose streams during training to reconstruct the encountered 3D scene in real-time. Intrinsic curiosity rewards are issued based on the prediction error between the rendered view from this model and the agentโ€™s actual RGB observation, with a low-pass filter and downsampling mitigating incidental high-frequency error. Unlike traditional forward dynamics models with short-term or non-episodic memory, the 3DGS-based model produces stable novelty signals, is resistant to catastrophic forgetting within an episode, and enables spatial persistence necessary for exploration.

Episodic Sequence Policy for Novelty Seeking

The agent is parameterized as a transformer with per-step input streams comprising RGB observations and action representations (as Plรผcker-ray images). A novel fusion layer cross-attends learned frame queries to both image patches and DINOv2 visual features. The sequence model employs sliding-window causal attention for tractable training and a global linear-attention memory module to support very long-term context. This enables the policy to condition actions on full episodic trajectory information, moving beyond map-based RL approaches that bypass semantic context, or limited-memory policies that cannot coordinate back-tracking or strategic exploration when rewards are locally sparse.

Training Procedure and Regularization

Policy optimization is executed with PPO, using the intrinsic reward as the sole signal. To prevent policy entropy collapse, a scheduled mixture of learned and random (uniform) policy actions is annealed through training, guaranteeing persistent exploratory behavior and robustness as intrinsic rewards become sparse. Critically, during deployment, the agent operates entirely from RGB; persistent mapping and privileged information are only utilized during training, aligning with real-world deployment constraints.

Empirical Results

Superior Exploration in Static Realistic 3D Environments

Evaluated on the HM3D and Gibson datasets, the proposed agent achieves higher 3D completeness (scene coverage) than baselines such as Active Neural SLAM (ANS) and Occupancy Anticipation (OccAnt). Notably, performance exceeds even methods utilizing privileged depth at test time, while maintaining a strict RGB-only test regime. The agent rapidly explores and covers more unique scene surface area across rollout horizons (256, 512, 1024 steps), with substantially reduced mean distance between observed and ground-truth points.

Ablative Analysis of Memory and Persistence

Ablation studies confirm strong dependence on both persistent world models and sufficient agent memory. Removing persistence from the 3DGS or limiting policy sequence length degrades exploration, with agents devolving into local loops or failing to navigate away from explored territories. Asymmetric memory (e.g., memoryless actor with memoryful critic) performs better than no memory, but full transformer sequence models yield the best results.

Transfer to Downstream Tasks and Zero-Shot Generalization

The agent, pretrained purely via curiosity on exploration, can be rapidly fine-tuned for sparse-reward navigation tasksโ€”such as apple-picking and image-goal navigationโ€”outperforming from-scratch policies, especially in extremely sparse reward settings. In addition, the agent generalizes zero-shot to AI-generated scenes with distinct appearance and rendering pipelines, demonstrating robustness to distribution shift.

Theoretical and Practical Implications

By explicitly separating deployment from training requirements, and decoupling spatial persistence from end-to-end policy memory, this work identifies two fundamental ingredients for scalable RL exploration: persistent closed-loop world models that mitigate state revisitation artifacts, and policy architectures that support long-horizon planning. The persistent 3DGS proxy grounds curiosity in observed environment structure, which, although restricted to static scenes, is an important baseline for the evaluation of future action-conditioned generative models.

The demonstrated flexibility of the policy to adapt to arbitrary semantic tasks post hoc, without pre-assumed geometric mapping at inference, underscores the potential for unified exploration architectures in multi-task embodied AI. Further, the empirical findings challenge the reliance on explicit map-based policies for RL exploration, advocating for end-to-end visually conditioned exploration as a more adaptable, scalable paradigm.

Future Directions

The authors identify a major limitation: the static world assumption of their current 3DGS-based forward model. Transitioning to learned, action-conditioned generative world models with persistent spatial memory will be crucial for extending these results to dynamic environments. As generative scene models mature, integrating intrinsic motivation grounded in persistent, adaptively updated representations will be necessary for robustness in the presence of environmental non-stationarity and temporal drift. Future work should also explore the implications for lifelong and continual learning, open-world task generalization, and embodied agents in real-world robotics.

Conclusion

This study rigorously demonstrates that curiosity-driven exploration in visually rich 3D environments requires both spatially persistent world models for reliable intrinsic reward and episodic sequence policies for long-horizon novelty planning. The presented system outperforms state-of-the-art RL map-based baselines, generalizes across domains, and is highly adaptable to downstream tasks with only RGB input. These findings set a new standard for scalable, task-agnostic exploration agents in embodied AI, delineating theoretical requirements for intrinsic motivation that must be met by future dynamic world model research.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview

This paper is about teaching a virtual โ€œcamera agentโ€ (think of a tiny drone with a camera) to explore 3D spaces on its own by being curious. Instead of waiting for a teacher to give it rewards like โ€œgood jobโ€ when it reaches a goal, the agent gives itself points for discovering new parts of the world. The big idea is that curiosity works best when:

  • the agent keeps a strong memory of what it has already seen in this episode, and
  • thereโ€™s a steady, sensible way to tell whether a view is truly new.

The authors show how to combine these two ideas so the agent explores houses and rooms efficiently, learns useful habits, and later adapts quickly to tasks like finding apples or matching a photo view.

Key Questions

The paper asks three simple questions:

  • How can we make a curious agent that doesnโ€™t get stuck wandering in circles?
  • What kind of โ€œmemoryโ€ does the agent need to avoid revisiting the same places over and over?
  • Can an agent trained to explore just by being curious later learn real tasks faster than starting from scratch?

How They Did It

To make this easy to picture, imagine youโ€™re exploring a new school building:

  • You keep a mental map of where youโ€™ve been so you donโ€™t loop back by mistake.
  • You feel excited when you see a hallway or room you havenโ€™t visited before.

The authors give the agent two key tools that mirror this:

  1. A persistent world model during training
  • What it is: While the agent explores, a fast 3D builder makes a growing model of the world from the cameraโ€™s pictures and depth (distance) information. Think of it as a living 3D scrapbook of everything seen so far.
  • Why it matters: When the agent looks from a new angle, the 3D scrapbook tries to โ€œpredictโ€ what the camera should see. If the real camera image looks different in an important way, that means the agent has found something truly newโ€”so it gets a curiosity reward. If itโ€™s the same old stuff, it gets little or no reward.
  • Important detail: This 3D builder is used only while training to compute the curiosity reward. At test time, the agent doesnโ€™t need a map; it acts just from the video it sees.
  1. An episodic memory inside the agent
  • What it is: The agentโ€™s brain is a sequence model (a transformer) that looks at a chain of recent images and actions, not just the current frame. Itโ€™s like the agent keeps a running memory of the episode.
  • Why it matters: With this memory, the agent can backtrack through places it has already seen to reach new branches, instead of getting stuck or forgetting where itโ€™s been.

A few extra, human-friendly notes:

  • โ€œCuriosity rewardโ€ = points for discovering truly new views, measured by how much the prediction from the 3D scrapbook disagrees with the actual camera view (after smoothing out tiny details so it doesnโ€™t get fooled by noisy textures).
  • โ€œSparse rewardโ€ = the world doesnโ€™t hand out points often, so the agent must care about its own curiosity signal to keep learning.
  • To prevent the agent from becoming too cautious, the authors sometimes mix in random actions during training. This keeps exploration lively and helps the agent escape slow or repetitive behavior.

Main Findings

Here are the big results and why they matter:

  • Better exploration with only a camera at test time: The agent covered more of new 3D homes faster than other methods that rely on hand-built maps or depth sensors during deployment. Thatโ€™s impressive because at test time it only uses RGB video framesโ€”no special map, no extra sensors.
  • Memory matters (on both sides):
    • If the 3D scrapbook is short-term or non-persistent, the agent can get โ€œfakeโ€ curiosity points by revisiting forgotten places and ends up looping.
    • If the agent itself doesnโ€™t remember its recent journey, it also falls into loops.
    • Together, a persistent world model (for the reward) and an agent with episodic memory (for decision-making) are crucial to unlock stable, long-range exploration.
  • Generalizes to new worlds: After training on realistic indoor scenes, the agent could explore different buildings and even AI-generated fantasy worlds without extra training. This means it learned general exploration skills, not just memorized specific maps.
  • Learns new tasks faster:
    • Apple picking: The agent found and โ€œpickedโ€ more apples than a brand-new agent trained only on that task. This advantage was strongest when apples were rare (sparser rewards), showing the power of a curiosity-trained explorer.
    • Image-goal navigation: Given a target picture, the fine-tuned agent reached the matching viewpoint more often than a from-scratch agent. Its exploration habits helped it search smartly.

Why This Matters

  • A recipe for curiosity that scales: The paper shows curiosity can work in complex, realistic 3D spacesโ€”if you pair it with both a persistent view of the world (for honest novelty signals) and an agent that remembers its own path (for smart decisions).
  • Less reliance on maps and extra sensors at deployment: The trained agent runs end-to-end from camera images alone, which makes it simpler and more flexible to use in different environments and tasks.
  • Faster learning on real tasks: Pretraining with curiosity gives the agent a โ€œsense of directionโ€ for exploration, helping it learn new goals with fewer trialsโ€”especially when rewards are rare.
  • A guide for future world models: As video and 3D world models improve, this work highlights that โ€œspatial persistenceโ€ and continuous updating are must-haves if we want curiosity-driven agents to behave well in the real world.

In short, the paperโ€™s message is in its title: if you want an AI to be a great explorer, make it remember to be curiousโ€”and give it the memory and steady signals it needs to do that reliably.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide future research.

  • Dynamic environments are out of scope; how to extend curiosity with persistent world models to scenes with moving objects, changing lighting, and non-stationary dynamics without rewarding transient changes?
  • Training depends on privileged depth and pose to build the 3DGS forward model; what happens under noisy, biased, or self-supervised pose/depth estimates (e.g., DUST3R, VGGTransformer), and can curiosity be learned without ground-truth sensors?
  • The computational footprint is large (8ร—80GB H100 for 5.5 days; frequent online 3DGS optimization); what trade-offs exist between forward-model quality, update frequency, densification/pruning schedules, and exploration performance on resource-limited hardware?
  • Action space and embodiment are simplified (forward, look left/right, pause; spherical free-flight drone); how does the approach fare under continuous control, realistic dynamics (e.g., quadrotor physics), locomotion constraints, and richer action sets?
  • Motion is deterministic with ideal collision checking via ray-tracing; robustness to actuation noise, localization drift, and contact uncertainty remains untested.
  • Curiosity reward is a binary thresholded low-pass image discrepancy; its sensitivity to filter type, downsampling factor, and thresholds (T_new, T_old) is not analyzed; compare against principled novelty measures (information gain, model uncertainty, density estimation).
  • 3DGS provides geometry/appearance persistence but no explicit semantics; can semantic world models (e.g., panoptic 3D reconstructions) yield curiosity that prioritizes task-relevant novelty?
  • The forward model is only used during training; can leveraging a distilled or compact persistent state at test time improve targeted navigation without sacrificing end-to-end flexibility?
  • Episodic memory design (sliding window size W, placement/capacity of the linear-attention memory, query token formulation) lacks systematic exploration; what are scaling laws and failure modes for very long episodes?
  • No theoretical account explains why persistence plus episodic context mitigates curiosity loops; formalize conditions under which intrinsic rewards become stationary and policies converge.
  • Baseline coverage excludes several intrinsic motivation methods (RND, Disagreement, RIDE, E3B) in photorealistic 3D; run controlled head-to-head comparisons under matched sensors, action spaces, and collision handling.
  • Evaluation relies on 3D completeness and average distance; add metrics for exploration efficiency (time-to-novelty, loopiness), semantic coverage, safety (collision rate, near-misses), and compute/energy per unit coverage.
  • Generalization is shown in simulators (Gibson, two AI-generated worlds) but not in the real world; validate on physical robots with real sensing/actuation, clutter, and dynamic agents to assess sim-to-real transfer.
  • Multi-agent exploration is not considered; investigate shared persistent world models and coordinated episodic memory for cooperative coverage.
  • Reward sparsity mitigation uses a fixed random action injection schedule; evaluate alternative strategies (intrinsic goal-setting, option frameworks, adaptive entropy schedules) and their interactions with episodic memory.
  • Forward-model persistence ablations demonstrate qualitative benefits, but the minimal memory horizon required for effective exploration is not quantified; map exploration quality versus forward-model memory length.
  • Downstream tasks are limited to apple-picking and image-goal navigation; extend to manipulation, multi-step objectives, language-conditioned goals, and tasks requiring long-term semantic reasoning.
  • The image-goal success criterion requires privileged 3D point visibility; devise deployable evaluation protocols that avoid ground-truth meshes and depth at test time.
  • Potential 3DGS failure modes (specular/reflective surfaces, textureless regions, strong view-dependent effects) are not characterized; test and adapt curiosity under adverse visual conditions.
  • Impact of forward-model bias/artifacts on learning (rewarding reconstruction errors or lag) is unquantified; develop diagnostics and corrective reward shaping to handle model errors.
  • Episodic memory is reset per episode; study lifelong exploration where memory persists across episodes/scenes, avoiding relearning and enabling cumulative knowledge.
  • Safety is minimally modeled (collisions halt but arenโ€™t penalized strongly); incorporate risk-aware intrinsic rewards and explicit safety budgets to balance novelty-seeking and hazard avoidance.
  • Visual backbone choices (DINOv2 vs. alternatives), fusion strategies, and multimodal inputs at test time are not ablated; isolate which features most improve exploratory behavior.
  • The behavior policy is annealed from a mixture with uniform random during training but deterministic at test time; assess whether controlled stochasticity at deployment benefits coverage or goal-reaching.

Practical Applications

Immediate Applications

The following applications can be piloted or deployed today by leveraging the paperโ€™s core insights: (1) episodic, long-horizon policies that operate on RGB-only input at deployment; (2) curiosity-driven pretraining using a persistent world model (online 3D Gaussian Splatting) to supply stable intrinsic rewards; (3) efficient fine-tuning to sparse-reward downstream tasks; and (4) simple training-time regularization via intermittent random actions.

  • Robotics pretraining for sparse-reward tasks (navigation, object search)
    • Sectors: robotics, software, education/academia
    • What: Use the exploration-pretrained RGB-only policy as a general backbone, then fine-tune with minimal extrinsic rewards for tasks such as image-goal navigation or object finding (e.g., โ€œapple pickingโ€ analogs like locating valves, tools, or QR tags).
    • Tools/products/workflows: โ€œCuriosity-pretrained policyโ€ checkpoint; fine-tuning scripts on PPO; reward wrappers for object detectors; ROS integration to map discrete actions to mobile base/mini-UAV commands.
    • Assumptions/dependencies: Static or mostly static indoor spaces; training-time pose and depth (obtainable via SLAM/LiDAR or motion capture); sim-to-real calibration; safety layer for collision avoidance.
  • After-hours facility exploration and coverage for security and maintenance
    • Sectors: security, facilities management, enterprise robotics
    • What: Deploy robots after hours to explore offices/warehouses, maximize coverage (3D completeness), and flag hard-to-reach spaces for human inspection.
    • Tools/products/workflows: Coverage analytics dashboard (based on the paperโ€™s 3D completeness and average-distance metrics); ROS package to execute the RGB-only policy on a perimeter/patrol robot; basic anomaly tagging via add-on detectors.
    • Assumptions/dependencies: Environments are largely static during runs; fallback safety controller; compliance with building access/IT rules.
  • Reality-capture โ€œexplorer-in-the-loopโ€ for scanning teams
    • Sectors: AEC (architecture, engineering, construction), real estate, digital twins
    • What: Use the policy to suggest โ€œnext movesโ€ to maximize novel viewpoints and reduce missed areas during photogrammetry or NeRF/3DGS capture (operator-in-the-loop on a handheld rig or tethered drone).
    • Tools/products/workflows: Laptop or edge device runs the policy and overlays waypoints on a tablet HUD; post-hoc coverage reporting using the paperโ€™s metrics.
    • Assumptions/dependencies: Static scenes during capture; calibrated rig; operator retains control; regulatory compliance for UAVs.
  • Automated playtesting and coverage QA for game and synthetic worlds
    • Sectors: gaming, simulation/content platforms
    • What: Run the agent to probe 3D maps for accessible coverage, dead-ends, and unreachable areas; OOD generalization makes it robust to diverse art styles and procedural content.
    • Tools/products/workflows: Editor plugin that spawns exploration episodes, computes coverage metrics, and outputs heatmaps of โ€œunseenโ€ spaces; CI hook for level regression tests.
    • Assumptions/dependencies: Stable control interface to the engine; primarily static level geometry for the exploration runs.
  • Retail layout onboarding and inventory mapping pilots
    • Sectors: retail robotics, logistics
    • What: Use exploration to quickly learn new/store-refit layouts, produce initial coverage sweeps, and seed downstream tasks (e.g., aisle patrolling or shelf scanning) with fine-tuning.
    • Tools/products/workflows: Initial exploration run with RGB-only policy; follow-up short fine-tune on store-specific targets; compliance logging for coverage.
    • Assumptions/dependencies: Runs scheduled when stores are closed; static shelving during runs; store policies on data capture and privacy.
  • Academic toolkit for persistent-curiosity research
    • Sectors: academia, open-source software
    • What: Package and release a training harness that couples online 3DGS-based intrinsic rewards with episodic transformer agents, plus memory ablations and evaluation scripts.
    • Tools/products/workflows: PyTorch training code; Habitat-based environment configs; 3DGS training-time module; standardized 3D completeness metrics and reporting.
    • Assumptions/dependencies: Multi-GPU training (the paper used 8ร—80GB H100); HM3D/Gibson licenses; 3DGS libraries.
  • Training-time regularization recipe for long-horizon RL
    • Sectors: robotics/software R&D
    • What: Adopt the mixed-policy sampling (scheduled uniform random action injection) to stabilize exploration when intrinsic rewards become sparse.
    • Tools/products/workflows: PPO wrappers that track behavior distribution and anneal a mixing coefficient; hyperparameter presets.
    • Assumptions/dependencies: Discrete action spaces or discretized controls; careful annealing and logging to avoid destabilization.
  • Benchmarks and metrics adoption for exploration coverage
    • Sectors: academia, evaluation services, robotics QA
    • What: Standardize coverage metrics (3D completeness at fixed step horizons, average surface-point distance) to compare exploration policies in simulators and labs.
    • Tools/products/workflows: Evaluation kit; dataset splits; reporting templates for leaderboards.
    • Assumptions/dependencies: Ground-truth mesh or sufficiently dense scan for evaluation.
  • Pilot deployments for inspection target search in controlled industrial spaces
    • Sectors: manufacturing, utilities (static bays/off-hours), data centers
    • What: Use the fine-tuned policy to locate visual targets (gauges, panels, indicators) in structured, mostly static environments; combine with small rewards for detections.
    • Tools/products/workflows: Detector-in-the-loop reward shaping; safety supervisor; coverage and revisit reporting.
    • Assumptions/dependencies: Static or low-dynamics periods; clear line-of-sight to targets; facility safety and compliance.
  • Educational demos for embodied AI
    • Sectors: education, outreach
    • What: Use the agent in Habitat or similar sims to teach curiosity, intrinsic motivation, and memory in RL with tangible downstream tasks.
    • Tools/products/workflows: Instructor notebooks; modular ablations to visualize the impact of episodic memory and world persistence.
    • Assumptions/dependencies: Access to GPUs and sim assets.

Long-Term Applications

These require further R&D, scaling, dynamic world modeling, or productization beyond current constraints (notably: static-scene assumption, training-time reliance on pose/depth, and heavy training compute).

  • Dynamic-world curiosity with persistent action-conditioned models
    • Sectors: robotics (service, healthcare, industry), autonomy research
    • What: Replace 3DGS with a spatially persistent, action-conditioned video/world model that updates online in dynamic scenes (moving people/objects), enabling curiosity in live environments.
    • Tools/products/workflows: Onboard world model with spatial memory; continual learning pipelines; drift detection and safe policy fallback.
    • Assumptions/dependencies: Robust spatial persistence in generative models; compute- and memory-efficient on-device inference; safety certification.
  • Home service robots that explore, then specialize with minimal supervision
    • Sectors: consumer robotics, smart home
    • What: Robots that autonomously explore new homes with RGB-only at runtime, then fine-tune to user-specific tasks (find objects, fetch-and-carry, room-aware reminders).
    • Tools/products/workflows: Privacy-preserving on-device training; user-in-the-loop reward signals (โ€œfound the keysโ€); episodic memory management.
    • Assumptions/dependencies: Robust perception under changing layouts; privacy/security guarantees; safe operation around people and pets.
  • Search-and-rescue and emergency response exploration
    • Sectors: public safety, defense, insurance
    • What: UAVs/UGVs autonomously probe unknown buildings post-incident to map coverage, locate exits/victims, and report occluded spaces.
    • Tools/products/workflows: Dynamic obstacle handling; thermal/specialty sensor fusion; explainable coverage reports; operator handoff mechanisms.
    • Assumptions/dependencies: Highly dynamic scenes; strict safety and regulatory constraints; adverse conditions (smoke, dust, low light).
  • Multi-robot cooperative exploration with shared episodic memory
    • Sectors: logistics, industrial inspection, construction
    • What: Teams of robots share a persistent, continuously updated world memory (distributed or cloud) to coordinate coverage and reduce redundancy.
    • Tools/products/workflows: Federated memory fusion (e.g., distributed 3DGS or successor models); comms robustness; task allocation.
    • Assumptions/dependencies: Reliable networking or delay-tolerant synchronization; consistent calibration across platforms; conflict resolution in shared maps.
  • On-device AR guidance for casual 3D capture
    • Sectors: consumer AR, creative tools, real estate
    • What: Smartphone AR apps that guide users with โ€œnext-best-viewโ€ prompts powered by an RGB-only exploration policy, targeting complete, artifact-free scans for digital twins or 3D listings.
    • Tools/products/workflows: Lightweight model distillation; on-device DINO-like features; UI that visualizes coverage/novelty in real time.
    • Assumptions/dependencies: Mobile inference efficiency; battery and thermal constraints; robust pose estimation on commodity devices.
  • Autonomous cinematography and tour generation
    • Sectors: media production, travel/tourism, cultural heritage
    • What: Camera robots that explore and then plan cinematic coverage of interiors (museums, venues) with minimal operator input.
    • Tools/products/workflows: Semantic priors for framing and aesthetics; shot planning over persistent memory; collision-aware smooth trajectories.
    • Assumptions/dependencies: Mixed static/dynamic crowds; venue permissions; high-level aesthetic reward models.
  • Foundation models for embodied exploration
    • Sectors: AI platforms, robotics vendors
    • What: Large-scale pretraining of exploration policies across diverse 3D worlds (scans + generative worlds), fine-tuned for downstream tasks (navigation, search, manipulation).
    • Tools/products/workflows: Data engines combining simulators and synthetic worlds; standardized reward APIs; cross-embodiment action abstractions.
    • Assumptions/dependencies: Broad sim-to-real generalization; governance around synthetic data bias; compute and carbon costs.
  • Regulatory and policy frameworks for curiosity-driven autonomy
    • Sectors: policy, standards bodies, enterprise governance
    • What: Safety, privacy, and accountability standards for intrinsically motivated robots that move through private spaces; audit trails for exploration decisions.
    • Tools/products/workflows: Explainability tools that reconstruct episodic memory used for actions; on-device redaction; geofencing and โ€œdo-not-exploreโ€ constraints.
    • Assumptions/dependencies: Consensus on acceptable data retention and use; certification pathways; integration with building access controls.
  • Large-scale facility digitization and continual updating
    • Sectors: industrial operations, energy, smart buildings
    • What: Periodic autonomous exploratory passes keep digital twins up-to-date, flagging structural changes or occluded areas needing human follow-up.
    • Tools/products/workflows: Scheduling across downtime windows; change detection atop persistent memory; operator dashboards.
    • Assumptions/dependencies: Mixed dynamics in live facilities; infrastructure for autonomous charging/dispatch; integration with CMMS/BIM.
  • Curriculum design for robust long-horizon control
    • Sectors: academia, industrial R&D
    • What: Use scheduled random-action mixtures and intrinsic rewards as a general curriculum for long-horizon tasks beyond exploration (e.g., multi-room manipulation, tool use).
    • Tools/products/workflows: RL training curricula templates; policy validation suites; ablation harnesses for memory modules.
    • Assumptions/dependencies: Task-specific safety envelopes; scalable training infrastructure.

Cross-cutting assumptions and dependencies to keep in mind

  • Static-scene assumption: The presented methodโ€™s strongest results are in static indoor environments; performance may degrade with frequent layout changes or moving agents/objects until dynamic persistent world models mature.
  • Training-time privileges: Depth and pose are needed during training to build the 3DGS forward model (can be sourced via SLAM/LiDAR). Deployment uses RGB-only.
  • Action space and embodiment: The paper used a discrete action set and a drone-like embodiment; real platforms need action mapping, safety layers, and possibly continuous control.
  • Compute and data: Curiosity pretraining is compute-intensive (hundreds of millions of steps) and data-hungry; distilled or smaller models may be needed for edge deployment.
  • Safety, privacy, and compliance: Exploration in private or regulated spaces requires data governance, fail-safe behaviors, and operator oversight.

Glossary

  • 3D Gaussian Splatting (3DGS): An explicit, real-time 3D radiance field representation using Gaussian primitives for reconstruction and rendering; used here as a persistent world model. "We instantiate the forward model as an online 3D Gaussian Splatting (3DGS) model of the world"
  • 3DGS-MCMC: A densification method for 3DGS that leverages Markov Chain Monte Carlo to refine and add Gaussian primitives. "densified via 3DGS-MCMC [15]."
  • A* local planner: A graph search algorithm commonly used for path planning; cited here as a baseline component incompatible with the authorsโ€™ setup. "a test-time collision-unaware A* local planner"
  • action entropy coefficient: A hyperparameter scaling entropy regularization to maintain policy stochasticity during training. "the action entropy coefficient decayed at a rate of 0.99 from an initial value of 0.1."
  • action-conditioned video models: Generative models that predict future observations conditioned on the agentโ€™s actions. "action- conditioned video models show promise."
  • actor-critic: An RL architecture combining a policy (actor) and value function (critic) for learning control and state values. "connected to the actor and critic heads that output an action distribution and a value estimate"
  • annealing: Gradually reducing a training parameter (e.g., a mixing coefficient) over time to stabilize learning. "with the mixing coefficient annealed to zero over training"
  • bird's-eye-view: A top-down projection used for visualization or mapping. "trajectories are overlaid on bird's-eye-view for visualization only."
  • causal temporal self-attention: An attention mechanism that only attends to current and past tokens, preserving temporal causality. "Tokens are processed by causal temporal self-attention"
  • curiosity-driven reinforcement learning: An RL paradigm where intrinsic rewards based on novelty or prediction error drive exploration. "Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality."
  • differentiable renderer: A rendering process that supports gradient-based optimization; here, used to render from 3DGS. "where R denotes the differentiable 3DGS renderer."
  • DINOv2: A self-supervised vision transformer producing robust visual features used to augment the agentโ€™s perceptions. "We also take the RGB image processed by DINOv2 [17] to provide richer visual features."
  • down-sampling operator: An operation that reduces image resolution, often to stabilize or simplify comparisons. "Ds is down-sampling operator by a factor of s"
  • episodic context: Short-term memory of observations within an episode that guides navigation and exploration decisions. "stems from a lack of spatial persistence and episodic context."
  • forward model: A predictive model estimating future observations given past observations and actions. "The forward model is tasked with predicting the next observation conditioned on an action"
  • Gibson (dataset): A benchmark of indoor environments for embodied AI and navigation evaluation. "generalizes zero-shot to Gibson and AI-generated worlds."
  • Habitat (simulator): A simulation platform for embodied agents to interact with 3D environments. "a 90ยฐ FOV forward camera in Habitat [23]."
  • HM3D (dataset): Habitat-Matterport 3D dataset of large-scale indoor scenes for embodied AI. "Trained purely via curiosity on HM3D, our agent outperforms active- mapping baselines"
  • image-goal navigation: A task where the agent must reach the viewpoint corresponding to a given target image. "our exploration agent, when fine-tuned for a few episodes on image-goal navigation reward, outperforms an agent trained from scratch"
  • intrinsic reward: An internally generated learning signal (e.g., surprise) that incentivizes exploration without external task rewards. "the agent derives intrinsic reward from surprise"
  • Intrinsic Curiosity Module (ICM): An approach that rewards prediction error in a learned dynamics model to drive exploration. "Traditional methods like ICM [5] lack this property"
  • linear attention: An attention variant with linear complexity, often maintaining a compact global state for long contexts. "a linear-attention module with a global hidden state"
  • LoGeR: A long-context memory architecture used as inspiration for the authorsโ€™ global memory module. "LoGeR-style long-context architectures [18, 19]."
  • low-pass filter: A filter that attenuates high-frequency image details, used to stabilize novelty estimation. "where B is a low-pass filter"
  • navmesh: A navigation mesh representing traversable surfaces used by planners; here, explicitly avoided to prevent shortcuts. "Our drone agent is not constrained to the scene navmesh."
  • next-best-view (NBV): A strategy to select viewpoints that maximize expected information gain for mapping. "Traditional next-best-view (NBV) methods greedily select viewpoints to maximize geometric information gain"
  • Occupancy Anticipation (OccAnt): A learned mapping approach that anticipates occupancy to guide exploration. "Occupancy Anticipation (OccAnt) [8]"
  • on-policy RL: Reinforcement learning that updates the policy using data collected by the current policy. "a stable reward for the agent to optimize with on-policy RL."
  • online 3D reconstruction: Incremental building of a 3D scene model from streaming RGB-D observations. "We therefore utilize a state-of-the-art online 3D reconstruction method (3DGS) as a proxy"
  • Plรผcker-ray image: An image-based encoding of rays using Plรผcker coordinates to represent intended camera motion. "Plรผcker-ray image [16]"
  • privileged inputs: Training-only sensor information not available at test time (e.g., depth, pose). "the privileged inputs required at training time are the camera pose and depth image."
  • Proximal Policy Optimization (PPO): A policy gradient algorithm using clipped objectives to stabilize updates. "We optimize our actor-critic policy using PPO [20]."
  • random policy regularizer: A technique that intermittently samples random actions during rollouts to encourage exploration. "with the random policy regularizer scheduled from 20% to zero over 5 million steps"
  • self-supervised RL: Reinforcement learning driven by intrinsic signals or self-generated objectives rather than external labels. "we formulate exploration as a self-supervised RL problem"
  • sliding-window attention: Attention restricted to a recent temporal window to keep computation tractable over long sequences. "Sliding-window attention provides efficient direct local context"
  • transformer backbone: A transformer-based model serving as the core architecture for sequence processing and control. "we use a transformer backbone"
  • uniform random policy: A policy that selects actions uniformly at random, used here to maintain exploration during training. "we occasionally sample actions from a uniform random policy"
  • world model: An internal model predicting environmental dynamics or observations to support planning and curiosity. "the prediction error of a world model - trained alongside the agent - to anticipate the consequences of its actions"
  • zero-shot generalization: The ability to transfer to new environments or tasks without further training. "generalizes zero-shot to Gibson and AI-generated worlds."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 19 tweets with 387 likes about this paper.