Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Pose-free Memory Compressor

Updated 9 February 2026
  • The paper introduces HPMC, a pose-free memory compressor that achieves sub-linear memory complexity for long-horizon video modeling without requiring explicit pose data.
  • It employs a two-stage compression strategy that uses direct downsampling for short horizons and recursive, hierarchical compression for longer temporal sequences.
  • Empirical evaluations demonstrate improved spatial consistency and reduced GPU usage, enabling scalable and robust interactive world modeling over thousands of frames.

The Hierarchical Pose-free Memory Compressor (HPMC) is a compact, recursively compositional memory architecture introduced as the core component in Infinite-World, a model for interactive world modeling over thousand-frame video horizons without reliance on explicit pose information. HPMC enables a Video Diffusion Transformer (DiT) to attend to extensive temporal histories with a fixed, sub-linear memory and computational budget, while autonomously learning to preserve salient spatio-temporal cues necessary for visually and spatially consistent long-horizon simulation (Wu et al., 2 Feb 2026).

1. Objectives and Foundational Principles

HPMC is designed to address two primary limitations in conventional long-horizon video or world modeling:

  • Bounded, sub-linear resource usage: Attending to thousands of prior frames in a naively parameterized transformer leads to quadratic growth in storage and memory with respect to the temporal context LL. HPMC constrains both memory and compute to a fixed budget, ensuring scalability as LL increases.
  • Pose-free anchoring: Traditional models often require precise pose or viewpoint metadata, which is unreliable or unattainable in real-world data. HPMC dispenses with explicit geometric priors, instead distilling those features directly from the latent sequence that promote spatial and semantic consistency, such as loop-closure landmarks, via joint training with the generative backbone.

By enabling end-to-end optimization of memory compression together with the world model, HPMC ensures retention of exactly those historical cues most predictive for future generation, without recourse to explicit camera pose (Wu et al., 2 Feb 2026).

2. Memory Compression Architecture

HPMC operates in one of two compression modes depending on the length LL of the context relative to a fixed token budget TmaxT_{\max}. The transition between modes is automatic.

Mode 1: Short-Horizon (Direct) Compression

When LkTmaxL \leq k \cdot T_{\max}, the entire history is directly downsampled along the temporal axis:

  • A temporal encoder fϕf_\phi transforms z1:LRL×dz_{1:L} \in \mathbb{R}^{L \times d} into zcom=fϕ(z1:L)R(L/k)×dz_{\text{com}} = f_\phi(z_{1:L}) \in \mathbb{R}^{(L/k) \times d}, using a compression factor k=4k=4.
  • The compressed history zcomz_{\text{com}} is concatenated with the latest latent zloc=zLz_{\text{loc}} = z_L, the noisy target latents ztz_t, and a binary mask m\mathbf{m} to form the DiT's conditioning.

Mode 2: Long-Horizon (Hierarchical) Compression

When L>kTmaxL > k \cdot T_{\max}, the latent sequence is recursively compressed in a hierarchical two-stage process:

  1. Local Compression:
    • Partition z1:Lz_{1:L} into NN (typically N=5N=5) overlapping chunks of size W=64W=64 frames using a sliding window of stride SS such that the full LL frames are covered.
    • Each chunk ChunkiRW×d\mathrm{Chunk}_i \in \mathbb{R}^{W \times d} is compressed via fϕf_\phi to hi=fϕ(Chunki)R(W/k)×d=R16×dh_i = f_\phi(\mathrm{Chunk}_i) \in \mathbb{R}^{(W/k) \times d} = \mathbb{R}^{16 \times d}.
  2. Global Compression:
    • Concatenate the local summaries H=Concat(h1,...,hN)R(NW/k)×d=R80×dH = \mathrm{Concat}(h_1, ..., h_N) \in \mathbb{R}^{(N \cdot W / k) \times d} = \mathbb{R}^{80 \times d}.
    • Apply fϕf_\phi again: zcom=fϕ(H)RTmax×dz_{\text{com}} = f_\phi(H) \in \mathbb{R}^{T_{\max} \times d}, with Tmax=20T_{\max} = 20 fixed.

This architecture guarantees that the final memory footprint does not exceed TmaxT_{\max} tokens, keeping O(Tmax2T_{\max}^2) for self-attention and O(TmaxdT_{\max} \, d) storage at the global compression stage.

Mode Input Size Compression Steps Output Context Size
Direct LkTmaxL \leq k\,T_{\max} Single fϕf_\phi downsampling L/kL/k
Hierarchical L>kTmaxL > k\,T_{\max} Chunking \rightarrow 2-stage fϕf_\phi TmaxT_{\max}

3. Mathematical Formulation and Training Objective

The following key equations govern HPMC operation:

  • Direct Compression (Mode 1):

zcom=fϕ(z1:L),zcomRL/4×dz_{\text{com}} = f_\phi(z_{1:L}),\qquad z_{\text{com}} \in \mathbb{R}^{L/4 \times d}

  • Hierarchical Compression (Mode 2):

zcom=fϕ(Concat{fϕ(Chunk1),...,fϕ(ChunkN)})RTmax×dz_{\text{com}} = f_\phi\Bigl(\mathrm{Concat}\{ f_\phi(\mathrm{Chunk}_1), ..., f_\phi(\mathrm{Chunk}_N)\}\Bigr) \in \mathbb{R}^{T_{\max} \times d}

Ldiff=Ez0,ε,t  εεθ(zt,t;[zcom,zloc,m])22\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0,\,\varepsilon,\,t}\; \bigl\lVert \varepsilon - \varepsilon_{\theta}(z_t,\,t;\, [z_{\text{com}},\,z_{\mathrm{loc}},\,\mathbf{m}]) \bigr\rVert_2^2

Here, both fϕf_\phi (the compressor) and εθ\varepsilon_\theta (the diffusion backbone) are optimized jointly so that the memory representation specializes for prediction.

  • Memory Budget:

zcomtimeTmax,Tmax=20|z_{\text{com}}|_{\text{time}} \leq T_{\max}, \qquad T_{\max} = 20

This enforces a constant context length irrespective of LL.

4. Implementation Details and Hyper-parameters

  • Temporal compression factor: k=4k=4
  • Number of chunks: N=5N=5
  • Chunk size: W=64W=64 frames
  • Memory budget: Tmax=20T_{\max}=20 tokens
  • Compressor encoder: fϕf_\phi is a compact 3D-ResNet latent encoder.
  • Optimizer: AdamW, learning rate 1×1051 \times 10^{-5}

Computational Complexity:

  • Baseline transformer attention/storage: O(L2L^2)
  • Direct compression: O(LL/kL \cdot L/k)
  • Hierarchical compression: O(Tmax2T_{\max}^2) for self-attention, with local O(NWdN W d) and final global O(TmaxdT_{\max} d) storage.
  • Empirical resource benchmark: On an 80 GB H800, memory plateaued at approximately 45 GB for L>180L > 180.

Training Procedure:

  1. Pre-training with direct compression on short clips (up to 4 chunks) emphasizes local dynamics.
  2. Fine-tuning with hierarchical mode employs a Revisit-Dense dataset (30 minutes with loop-closure scenarios) to enhance long-range consistency.

Practical Stability Techniques:

  • Align action-embedding downsampling with kk for token synchrony.
  • Employ context mask m\mathbf{m} to distinguish denoised and frozen tokens.
  • Joint gradient propagation through fϕf_\phi ensures memory focuses on predictive latents.

5. Pose-Free Anchoring and Temporal Consistency

No explicit geometry, pose, or viewpoint metadata is ingested by HPMC at any stage. The mechanism for long-range consistency arises solely from the capacity of fϕf_\phi to recursively "discover" and preserve, via end-to-end optimization, those latent codes in the distant past that are essential for loop-closure or to reinforce a globally consistent scene layout. This enables the Infinite-World model to preserve coherent spatial memory over 1000+ frames solely through self-supervised, scene-predictive cues embedded in the latent history.

By constraining memory to TmaxT_{\max} tokens, HPMC also maintains bounded compute and storage, enabling simulation and generation across arbitrarily long trajectories without cumulative degradation in scene consistency.

6. Comparative Significance and Observed Outcomes

Empirical evaluation in Infinite-World demonstrated that HPMC enabled superior visual fidelity, robust action controllability, and improved spatial consistency over long video horizons, as assessed by objective and subjective metrics (Wu et al., 2 Feb 2026). The recursive memory compression yielded a plateau in GPU usage, contrasting with common quadratic resource growth in vanilla transformer approaches.

Unlike alternatives that leverage geometric priors or require frequent viewpoint revisits, HPMC's pose-free nature facilitates robust deployment on challenging, real-world data sources where pose estimation is noisy or infeasible.

7. Summary and Broader Implications

The Hierarchical Pose-free Memory Compressor is a lightweight, two-stage latent summarizer for scalable, pose-agnostic sequence memory in long-horizon generative video modeling. Its recursive, hierarchically-compressed memory mechanism:

  • Enables transformer-based models to condition on thousands of frames at fixed resource cost.
  • Avoids reliance on explicit geometry or pose via end-to-end training that captures exactly those historical cues needed for spatial and temporal consistency.
  • Is broadly applicable to real-world video sequences, including scenarios with rare viewpoint revisits, by jointly learning memory summarization with the generative diffusion backbone.

A plausible implication is that this form of pose-free memory compression supports the development of interactive world models for agents and applications where traditional geometric supervision is either impractical or unreliable, thus offering a scalable foundation for general long-horizon video synthesis and world simulation (Wu et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Pose-free Memory Compressor (HPMC).