Hierarchical Pose-free Memory Compressor
- The paper introduces HPMC, a pose-free memory compressor that achieves sub-linear memory complexity for long-horizon video modeling without requiring explicit pose data.
- It employs a two-stage compression strategy that uses direct downsampling for short horizons and recursive, hierarchical compression for longer temporal sequences.
- Empirical evaluations demonstrate improved spatial consistency and reduced GPU usage, enabling scalable and robust interactive world modeling over thousands of frames.
The Hierarchical Pose-free Memory Compressor (HPMC) is a compact, recursively compositional memory architecture introduced as the core component in Infinite-World, a model for interactive world modeling over thousand-frame video horizons without reliance on explicit pose information. HPMC enables a Video Diffusion Transformer (DiT) to attend to extensive temporal histories with a fixed, sub-linear memory and computational budget, while autonomously learning to preserve salient spatio-temporal cues necessary for visually and spatially consistent long-horizon simulation (Wu et al., 2 Feb 2026).
1. Objectives and Foundational Principles
HPMC is designed to address two primary limitations in conventional long-horizon video or world modeling:
- Bounded, sub-linear resource usage: Attending to thousands of prior frames in a naively parameterized transformer leads to quadratic growth in storage and memory with respect to the temporal context . HPMC constrains both memory and compute to a fixed budget, ensuring scalability as increases.
- Pose-free anchoring: Traditional models often require precise pose or viewpoint metadata, which is unreliable or unattainable in real-world data. HPMC dispenses with explicit geometric priors, instead distilling those features directly from the latent sequence that promote spatial and semantic consistency, such as loop-closure landmarks, via joint training with the generative backbone.
By enabling end-to-end optimization of memory compression together with the world model, HPMC ensures retention of exactly those historical cues most predictive for future generation, without recourse to explicit camera pose (Wu et al., 2 Feb 2026).
2. Memory Compression Architecture
HPMC operates in one of two compression modes depending on the length of the context relative to a fixed token budget . The transition between modes is automatic.
Mode 1: Short-Horizon (Direct) Compression
When , the entire history is directly downsampled along the temporal axis:
- A temporal encoder transforms into , using a compression factor .
- The compressed history is concatenated with the latest latent , the noisy target latents , and a binary mask to form the DiT's conditioning.
Mode 2: Long-Horizon (Hierarchical) Compression
When , the latent sequence is recursively compressed in a hierarchical two-stage process:
- Local Compression:
- Partition into (typically ) overlapping chunks of size frames using a sliding window of stride such that the full frames are covered.
- Each chunk is compressed via to .
- Global Compression:
- Concatenate the local summaries .
- Apply again: , with fixed.
This architecture guarantees that the final memory footprint does not exceed tokens, keeping O() for self-attention and O() storage at the global compression stage.
| Mode | Input Size | Compression Steps | Output Context Size |
|---|---|---|---|
| Direct | Single downsampling | ||
| Hierarchical | Chunking 2-stage |
3. Mathematical Formulation and Training Objective
The following key equations govern HPMC operation:
- Direct Compression (Mode 1):
- Hierarchical Compression (Mode 2):
- Denoising Diffusion Loss (Joint Training):
Here, both (the compressor) and (the diffusion backbone) are optimized jointly so that the memory representation specializes for prediction.
- Memory Budget:
This enforces a constant context length irrespective of .
4. Implementation Details and Hyper-parameters
- Temporal compression factor:
- Number of chunks:
- Chunk size: frames
- Memory budget: tokens
- Compressor encoder: is a compact 3D-ResNet latent encoder.
- Optimizer: AdamW, learning rate
Computational Complexity:
- Baseline transformer attention/storage: O()
- Direct compression: O()
- Hierarchical compression: O() for self-attention, with local O() and final global O() storage.
- Empirical resource benchmark: On an 80 GB H800, memory plateaued at approximately 45 GB for .
Training Procedure:
- Pre-training with direct compression on short clips (up to 4 chunks) emphasizes local dynamics.
- Fine-tuning with hierarchical mode employs a Revisit-Dense dataset (30 minutes with loop-closure scenarios) to enhance long-range consistency.
Practical Stability Techniques:
- Align action-embedding downsampling with for token synchrony.
- Employ context mask to distinguish denoised and frozen tokens.
- Joint gradient propagation through ensures memory focuses on predictive latents.
5. Pose-Free Anchoring and Temporal Consistency
No explicit geometry, pose, or viewpoint metadata is ingested by HPMC at any stage. The mechanism for long-range consistency arises solely from the capacity of to recursively "discover" and preserve, via end-to-end optimization, those latent codes in the distant past that are essential for loop-closure or to reinforce a globally consistent scene layout. This enables the Infinite-World model to preserve coherent spatial memory over 1000+ frames solely through self-supervised, scene-predictive cues embedded in the latent history.
By constraining memory to tokens, HPMC also maintains bounded compute and storage, enabling simulation and generation across arbitrarily long trajectories without cumulative degradation in scene consistency.
6. Comparative Significance and Observed Outcomes
Empirical evaluation in Infinite-World demonstrated that HPMC enabled superior visual fidelity, robust action controllability, and improved spatial consistency over long video horizons, as assessed by objective and subjective metrics (Wu et al., 2 Feb 2026). The recursive memory compression yielded a plateau in GPU usage, contrasting with common quadratic resource growth in vanilla transformer approaches.
Unlike alternatives that leverage geometric priors or require frequent viewpoint revisits, HPMC's pose-free nature facilitates robust deployment on challenging, real-world data sources where pose estimation is noisy or infeasible.
7. Summary and Broader Implications
The Hierarchical Pose-free Memory Compressor is a lightweight, two-stage latent summarizer for scalable, pose-agnostic sequence memory in long-horizon generative video modeling. Its recursive, hierarchically-compressed memory mechanism:
- Enables transformer-based models to condition on thousands of frames at fixed resource cost.
- Avoids reliance on explicit geometry or pose via end-to-end training that captures exactly those historical cues needed for spatial and temporal consistency.
- Is broadly applicable to real-world video sequences, including scenarios with rare viewpoint revisits, by jointly learning memory summarization with the generative diffusion backbone.
A plausible implication is that this form of pose-free memory compression supports the development of interactive world models for agents and applications where traditional geometric supervision is either impractical or unreliable, thus offering a scalable foundation for general long-horizon video synthesis and world simulation (Wu et al., 2 Feb 2026).