Papers
Topics
Authors
Recent
Search
2000 character limit reached

Slot-Based Historical Image Compression

Updated 13 March 2026
  • The paper introduces SHIC, which uses recurrent slot attention and GRU updates to compress lengthy visual histories into fixed-length embeddings for UAV navigation tasks.
  • SHIC processes multi-view UAV images by converting thousands of visual tokens into a concise set of learnable slot embeddings, enabling efficient long-horizon reasoning.
  • The module achieves over 3000× compression while preserving crucial semantic context and maintaining bounded memory and compute costs.

Slot-Based Historical Image Compression (SHIC) is an architectural module for efficiently distilling high-dimensional, temporally extended visual histories into compact, expressive fixed-length representations. First introduced within the LongFly framework for long-horizon UAV vision-and-language navigation, SHIC transforms large-scale multi-view image sequences into a concise set of learnable slot embeddings. Through recurrent attention and gated updates, SHIC preserves crucial semantic and landmark information over extensive visual histories while maintaining strictly bounded memory and compute costs. Its integration into complex, spatiotemporally aware agents enables tractable reasoning over long observation windows without prohibitive resource demands (Jiang et al., 26 Dec 2025).

1. Role within the LongFly Pipeline

SHIC operates as the initial sub-module in LongFly’s history-aware context modeling architecture for UAV navigation. The overall system comprises three major components:

  1. SHIC: Slot-based Historical Image Compression
  2. STE: Spatio-Temporal Trajectory Encoding
  3. PGM: Prompt-Guided Multimodal Integration

At each decision step tt, SHIC processes the accumulated visual history—specifically, all multi-view images {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}—to produce a fixed-length matrix of slot embeddings St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}. This set St1S_{t-1}, referred to as the compressed slot “memory,” feeds directly into the later multimodal PGM module, together with STE tokens and an instruction LL. By abstracting the variable-length sequence of raw visual tokens to a fixed number of semantically rich slot representations, SHIC facilitates scalable long-horizon temporal reasoning while overcoming the bottlenecks associated with storing and computing over the entire observation history.

2. Inputs, Outputs, and Dataflow

At each time step i=1,,t1i = 1, \ldots, t-1, the following inputs are provided to SHIC:

  • Multi-view RGB images Ri={Ri(1),,Ri(5)}R_i = \{R_i^{(1)}, \ldots, R_i^{(5)}\} from front, rear, left, right, and bottom UAV cameras.
  • A pretrained CLIP-based visual encoder FvF_v maps each image to spatial tokens:

Zi=Fv(Ri)={zi,1,,zi,Ni},zi,jRdZ_i = F_v(R_i) = \{z_{i,1}, \ldots, z_{i,N_i}\}, \quad z_{i,j} \in \mathbb{R}^d

Typical dimensionalities are d=768d=768 or {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}0, and {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}1 tokens (from {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}2 grids across five views).

After iterative integration of all {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}3 histories, the output is a set of {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}4 slots:

{R1,,Rt1}\{R_1,\ldots,R_{t-1}\}5

Here, {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}6 is a hyperparameter typically set in the range {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}7, with {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}8 as a canonical value. This output matrix substitutes for the vastly larger set {R1,,Rt1}\{R_1,\ldots,R_{t-1}\}9 in downstream processing.

3. Core Mathematical Operations

SHIC compresses historical visual tokens via recurrent slot attention augmented by Gated Recurrent Unit (GRU) updates:

(a) Initialization

At St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}0, initialize slot matrix as learnable parameters:

St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}1

(b) Query-Key-Value (QKV) Projection

For each subsequent time step St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}2:

  • For slots from previous step: St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}3
  • For new tokens: St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}4 Project into QKV space: St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}5 with St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}6.

(c) Slot-Token Attention

Compute slot-to-token attention via scaled dot-product: St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}7 This yields an affinity matrix between all St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}8 slots and St1RK×dS_{t-1} \in \mathbb{R}^{K \times d}9 tokens.

(d) Token Aggregation and Slot Update

Update each slot with a weighted sum of token values: St1S_{t-1}0 Aggregate into matrix St1S_{t-1}1.

(e) Recurrent (GRU) Fusion

Update slots recurrently: St1S_{t-1}2 After all steps, output St1S_{t-1}3 as the compressed visual memory.

4. Pseudocode and Training Approach

The SHIC compression process can be summarized as follows:

i=1,,t1i = 1, \ldots, t-19

There is no explicit reconstruction or auxiliary loss in SHIC; its parameters (QKV projections, GRU, slot initialization) are updated jointly with the entire LongFly model. Supervised losses—such as those arising in PGM for waypoint regression—implicitly drive the training of SHIC parameters via back-propagation (Jiang et al., 26 Dec 2025).

5. Hyperparameters and Trade-Offs

The primary hyperparameters in SHIC are St1S_{t-1}4 (number of slots) and St1S_{t-1}5 (feature dimension):

  • K (slots): Controls the memory capacity and compression fidelity. Smaller St1S_{t-1}6 (e.g., 8) yields higher compression and lower compute, at the expense of semantic detail. Larger St1S_{t-1}7 (St1S_{t-1}8) increases memory and fidelity but raises computational cost.
  • d (token/slot dimension): Determined by CLIP encoder, e.g. St1S_{t-1}9.
  • LL0 (tokens per step): Typically LL1, from 5 views.
  • Softmax temperature LL2: Stabilizes gradient flow during slot-token attention.
  • Compression ratio: For LL3, LL4, LL5:

LL6

  • Computational complexity:
    • Attention: LL7 per step.
    • Memory: LL8, compared to LL9 without compression.
  • Returns from increasing i=1,,t1i = 1, \ldots, t-10 diminish beyond i=1,,t1i = 1, \ldots, t-11; excess slots often become redundant.

These trade-offs control real-time performance, with practical i=1,,t1i = 1, \ldots, t-12 values ensuring that i=1,,t1i = 1, \ldots, t-13 fits deployment constraints while maximizing informative context retention.

6. Architectural Dynamics and Information Flow

Conceptually, SHIC instantiates a fixed set of i=1,,t1i = 1, \ldots, t-14 semantic “buckets,” each aggregating and abstracting distributed visual information across the trajectory:

  • At each time step, five camera images generate a “cloud” of i=1,,t1i = 1, \ldots, t-15 CLIP tokens.
  • The i=1,,t1i = 1, \ldots, t-16 slot nodes, initialized with learnable parameters, evolve recurrently over time as each attends to and integrates new observational tokens.
  • Slot attention yields incremental slot-wise updates, temporally propagating and refining landmark and contextual information through recurrent GRU fusion.
  • At the conclusion of the historical window, i=1,,t1i = 1, \ldots, t-17 comprises a checkpointed set of embeddings characterizing the salient visual context for integration with trajectory and instruction data in downstream navigation planning.

This recurrent, slot-centric abstraction scheme enables generalization to arbitrarily long observation histories, eliminating the need for computationally intensive global attention over all tokens.

7. Compression Efficacy and Limitations

SHIC achieves substantial compression of spatiotemporal visual history, with empirical compression ratios exceeding 3,000× under typical navigation horizons and tokenization schemes. The architecture explicitly bounds memory and per-step compute irrespective of visual history length.

A plausible implication is that, while increasing i=1,,t1i = 1, \ldots, t-18 improves representational fidelity, task performance improvements saturate rapidly and surplus slots add redundancy rather than new information. Moreover, the absence of explicit reconstruction loss may limit the preservation of certain fine-grained details. Nonetheless, in applied contexts such as vision-and-language navigation, SHIC demonstrates sufficient semantic precision, outperforming alternative memory architectures in both success rate and path-length-weighted success metrics (Jiang et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slot-Based Historical Image Compression (SHIC).