Hierarchical Pose-free Memory Compressor

Updated 9 February 2026

The paper introduces HPMC, a pose-free memory compressor that achieves sub-linear memory complexity for long-horizon video modeling without requiring explicit pose data.
It employs a two-stage compression strategy that uses direct downsampling for short horizons and recursive, hierarchical compression for longer temporal sequences.
Empirical evaluations demonstrate improved spatial consistency and reduced GPU usage, enabling scalable and robust interactive world modeling over thousands of frames.

The Hierarchical Pose-free Memory Compressor (HPMC) is a compact, recursively compositional memory architecture introduced as the core component in Infinite-World, a model for interactive world modeling over thousand-frame video horizons without reliance on explicit pose information. HPMC enables a Video Diffusion Transformer (DiT) to attend to extensive temporal histories with a fixed, sub-linear memory and computational budget, while autonomously learning to preserve salient spatio-temporal cues necessary for visually and spatially consistent long-horizon simulation (Wu et al., 2 Feb 2026).

1. Objectives and Foundational Principles

HPMC is designed to address two primary limitations in conventional long-horizon video or world modeling:

Bounded, sub-linear resource usage: Attending to thousands of prior frames in a naively parameterized transformer leads to quadratic growth in storage and memory with respect to the temporal context $L$ . HPMC constrains both memory and compute to a fixed budget, ensuring scalability as $L$ increases.
Pose-free anchoring: Traditional models often require precise pose or viewpoint metadata, which is unreliable or unattainable in real-world data. HPMC dispenses with explicit geometric priors, instead distilling those features directly from the latent sequence that promote spatial and semantic consistency, such as loop-closure landmarks, via joint training with the generative backbone.

By enabling end-to-end optimization of memory compression together with the world model, HPMC ensures retention of exactly those historical cues most predictive for future generation, without recourse to explicit camera pose (Wu et al., 2 Feb 2026).

2. Memory Compression Architecture

HPMC operates in one of two compression modes depending on the length $L$ of the context relative to a fixed token budget $T_{\max}$ . The transition between modes is automatic.

Mode 1: Short-Horizon (Direct) Compression

When $L \leq k \cdot T_{\max}$ , the entire history is directly downsampled along the temporal axis:

A temporal encoder $f_\phi$ transforms $z_{1:L} \in \mathbb{R}^{L \times d}$ into $z_{\text{com}} = f_\phi(z_{1:L}) \in \mathbb{R}^{(L/k) \times d}$ , using a compression factor $k=4$ .
The compressed history $z_{\text{com}}$ is concatenated with the latest latent $z_{\text{loc}} = z_L$ , the noisy target latents $z_t$ , and a binary mask $\mathbf{m}$ to form the DiT's conditioning.

Mode 2: Long-Horizon (Hierarchical) Compression

When $L > k \cdot T_{\max}$ , the latent sequence is recursively compressed in a hierarchical two-stage process:

Local Compression:
- Partition $z_{1:L}$ into $N$ (typically $N=5$ ) overlapping chunks of size $W=64$ frames using a sliding window of stride $S$ such that the full $L$ frames are covered.
- Each chunk $\mathrm{Chunk}_i \in \mathbb{R}^{W \times d}$ is compressed via $f_\phi$ to $h_i = f_\phi(\mathrm{Chunk}_i) \in \mathbb{R}^{(W/k) \times d} = \mathbb{R}^{16 \times d}$ .
Global Compression:
- Concatenate the local summaries $H = \mathrm{Concat}(h_1, ..., h_N) \in \mathbb{R}^{(N \cdot W / k) \times d} = \mathbb{R}^{80 \times d}$ .
- Apply $f_\phi$ again: $z_{\text{com}} = f_\phi(H) \in \mathbb{R}^{T_{\max} \times d}$ , with $T_{\max} = 20$ fixed.

This architecture guarantees that the final memory footprint does not exceed $T_{\max}$ tokens, keeping O( $T_{\max}^2$ ) for self-attention and O( $T_{\max} \, d$ ) storage at the global compression stage.

Mode	Input Size	Compression Steps	Output Context Size
Direct	$L \leq k\,T_{\max}$	Single $f_\phi$ downsampling	$L/k$
Hierarchical	$L > k\,T_{\max}$	Chunking $\rightarrow$ 2-stage $f_\phi$	$T_{\max}$

3. Mathematical Formulation and Training Objective

The following key equations govern HPMC operation:

Direct Compression (Mode 1):

$z_{\text{com}} = f_\phi(z_{1:L}),\qquad z_{\text{com}} \in \mathbb{R}^{L/4 \times d}$

Hierarchical Compression (Mode 2):

$z_{\text{com}} = f_\phi\Bigl(\mathrm{Concat}\{ f_\phi(\mathrm{Chunk}_1), ..., f_\phi(\mathrm{Chunk}_N)\}\Bigr) \in \mathbb{R}^{T_{\max} \times d}$

Denoising Diffusion Loss (Joint Training):

$\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0,\,\varepsilon,\,t}\; \bigl\lVert \varepsilon - \varepsilon_{\theta}(z_t,\,t;\, [z_{\text{com}},\,z_{\mathrm{loc}},\,\mathbf{m}]) \bigr\rVert_2^2$

Here, both $f_\phi$ (the compressor) and $\varepsilon_\theta$ (the diffusion backbone) are optimized jointly so that the memory representation specializes for prediction.

Memory Budget:

$|z_{\text{com}}|_{\text{time}} \leq T_{\max}, \qquad T_{\max} = 20$

This enforces a constant context length irrespective of $L$ .

4. Implementation Details and Hyper-parameters

Temporal compression factor: $k=4$
Number of chunks: $N=5$
Chunk size: $W=64$ frames
Memory budget: $T_{\max}=20$ tokens
Compressor encoder: $f_\phi$ is a compact 3D-ResNet latent encoder.
Optimizer: AdamW, learning rate $1 \times 10^{-5}$

Computational Complexity:

Baseline transformer attention/storage: O( $L^2$ )
Direct compression: O( $L \cdot L/k$ )
Hierarchical compression: O( $T_{\max}^2$ ) for self-attention, with local O( $N W d$ ) and final global O( $T_{\max} d$ ) storage.
Empirical resource benchmark: On an 80 GB H800, memory plateaued at approximately 45 GB for $L > 180$ .

Training Procedure:

Pre-training with direct compression on short clips (up to 4 chunks) emphasizes local dynamics.
Fine-tuning with hierarchical mode employs a Revisit-Dense dataset (30 minutes with loop-closure scenarios) to enhance long-range consistency.

Practical Stability Techniques:

Align action-embedding downsampling with $k$ for token synchrony.
Employ context mask $\mathbf{m}$ to distinguish denoised and frozen tokens.
Joint gradient propagation through $f_\phi$ ensures memory focuses on predictive latents.

5. Pose-Free Anchoring and Temporal Consistency

No explicit geometry, pose, or viewpoint metadata is ingested by HPMC at any stage. The mechanism for long-range consistency arises solely from the capacity of $f_\phi$ to recursively "discover" and preserve, via end-to-end optimization, those latent codes in the distant past that are essential for loop-closure or to reinforce a globally consistent scene layout. This enables the Infinite-World model to preserve coherent spatial memory over 1000+ frames solely through self-supervised, scene-predictive cues embedded in the latent history.

By constraining memory to $T_{\max}$ tokens, HPMC also maintains bounded compute and storage, enabling simulation and generation across arbitrarily long trajectories without cumulative degradation in scene consistency.

6. Comparative Significance and Observed Outcomes

Empirical evaluation in Infinite-World demonstrated that HPMC enabled superior visual fidelity, robust action controllability, and improved spatial consistency over long video horizons, as assessed by objective and subjective metrics (Wu et al., 2 Feb 2026). The recursive memory compression yielded a plateau in GPU usage, contrasting with common quadratic resource growth in vanilla transformer approaches.

Unlike alternatives that leverage geometric priors or require frequent viewpoint revisits, HPMC's pose-free nature facilitates robust deployment on challenging, real-world data sources where pose estimation is noisy or infeasible.

7. Summary and Broader Implications

The Hierarchical Pose-free Memory Compressor is a lightweight, two-stage latent summarizer for scalable, pose-agnostic sequence memory in long-horizon generative video modeling. Its recursive, hierarchically-compressed memory mechanism:

Enables transformer-based models to condition on thousands of frames at fixed resource cost.
Avoids reliance on explicit geometry or pose via end-to-end training that captures exactly those historical cues needed for spatial and temporal consistency.
Is broadly applicable to real-world video sequences, including scenarios with rare viewpoint revisits, by jointly learning memory summarization with the generative diffusion backbone.

A plausible implication is that this form of pose-free memory compression supports the development of interactive world models for agents and applications where traditional geometric supervision is either impractical or unreliable, thus offering a scalable foundation for general long-horizon video synthesis and world simulation (Wu et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Pose-free Memory Compressor (HPMC).

Hierarchical Pose-free Memory Compressor

1. Objectives and Foundational Principles

2. Memory Compression Architecture

Mode 1: Short-Horizon (Direct) Compression

Mode 2: Long-Horizon (Hierarchical) Compression

3. Mathematical Formulation and Training Objective

4. Implementation Details and Hyper-parameters

5. Pose-Free Anchoring and Temporal Consistency

6. Comparative Significance and Observed Outcomes

7. Summary and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hierarchical Pose-free Memory Compressor

1. Objectives and Foundational Principles

2. Memory Compression Architecture

Mode 1: Short-Horizon (Direct) Compression

Mode 2: Long-Horizon (Hierarchical) Compression

3. Mathematical Formulation and Training Objective

4. Implementation Details and Hyper-parameters

5. Pose-Free Anchoring and Temporal Consistency

6. Comparative Significance and Observed Outcomes

7. Summary and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research