Slot-Based Historical Image Compression

Updated 13 March 2026

The paper introduces SHIC, which uses recurrent slot attention and GRU updates to compress lengthy visual histories into fixed-length embeddings for UAV navigation tasks.
SHIC processes multi-view UAV images by converting thousands of visual tokens into a concise set of learnable slot embeddings, enabling efficient long-horizon reasoning.
The module achieves over 3000× compression while preserving crucial semantic context and maintaining bounded memory and compute costs.

Slot-Based Historical Image Compression (SHIC) is an architectural module for efficiently distilling high-dimensional, temporally extended visual histories into compact, expressive fixed-length representations. First introduced within the LongFly framework for long-horizon UAV vision-and-language navigation, SHIC transforms large-scale multi-view image sequences into a concise set of learnable slot embeddings. Through recurrent attention and gated updates, SHIC preserves crucial semantic and landmark information over extensive visual histories while maintaining strictly bounded memory and compute costs. Its integration into complex, spatiotemporally aware agents enables tractable reasoning over long observation windows without prohibitive resource demands (Jiang et al., 26 Dec 2025).

1. Role within the LongFly Pipeline

SHIC operates as the initial sub-module in LongFly’s history-aware context modeling architecture for UAV navigation. The overall system comprises three major components:

SHIC: Slot-based Historical Image Compression
STE: Spatio-Temporal Trajectory Encoding
PGM: Prompt-Guided Multimodal Integration

At each decision step $t$ , SHIC processes the accumulated visual history—specifically, all multi-view images $\{R_1,\ldots,R_{t-1}\}$ —to produce a fixed-length matrix of slot embeddings $S_{t-1} \in \mathbb{R}^{K \times d}$ . This set $S_{t-1}$ , referred to as the compressed slot “memory,” feeds directly into the later multimodal PGM module, together with STE tokens and an instruction $L$ . By abstracting the variable-length sequence of raw visual tokens to a fixed number of semantically rich slot representations, SHIC facilitates scalable long-horizon temporal reasoning while overcoming the bottlenecks associated with storing and computing over the entire observation history.

2. Inputs, Outputs, and Dataflow

At each time step $i = 1, \ldots, t-1$ , the following inputs are provided to SHIC:

Multi-view RGB images $R_i = \{R_i^{(1)}, \ldots, R_i^{(5)}\}$ from front, rear, left, right, and bottom UAV cameras.
A pretrained CLIP-based visual encoder $F_v$ maps each image to spatial tokens:

$Z_i = F_v(R_i) = \{z_{i,1}, \ldots, z_{i,N_i}\}, \quad z_{i,j} \in \mathbb{R}^d$

Typical dimensionalities are $d=768$ or $\{R_1,\ldots,R_{t-1}\}$ 0, and $\{R_1,\ldots,R_{t-1}\}$ 1 tokens (from $\{R_1,\ldots,R_{t-1}\}$ 2 grids across five views).

After iterative integration of all $\{R_1,\ldots,R_{t-1}\}$ 3 histories, the output is a set of $\{R_1,\ldots,R_{t-1}\}$ 4 slots:

$\{R_1,\ldots,R_{t-1}\}$ 5

Here, $\{R_1,\ldots,R_{t-1}\}$ 6 is a hyperparameter typically set in the range $\{R_1,\ldots,R_{t-1}\}$ 7, with $\{R_1,\ldots,R_{t-1}\}$ 8 as a canonical value. This output matrix substitutes for the vastly larger set $\{R_1,\ldots,R_{t-1}\}$ 9 in downstream processing.

3. Core Mathematical Operations

SHIC compresses historical visual tokens via recurrent slot attention augmented by Gated Recurrent Unit (GRU) updates:

(a) Initialization

At $S_{t-1} \in \mathbb{R}^{K \times d}$ 0, initialize slot matrix as learnable parameters:

$S_{t-1} \in \mathbb{R}^{K \times d}$ 1

(b) Query-Key-Value (QKV) Projection

For each subsequent time step $S_{t-1} \in \mathbb{R}^{K \times d}$ 2:

For slots from previous step: $S_{t-1} \in \mathbb{R}^{K \times d}$ 3
For new tokens: $S_{t-1} \in \mathbb{R}^{K \times d}$ 4 Project into QKV space: $S_{t-1} \in \mathbb{R}^{K \times d}$ 5 with $S_{t-1} \in \mathbb{R}^{K \times d}$ 6.

(c) Slot-Token Attention

Compute slot-to-token attention via scaled dot-product: $S_{t-1} \in \mathbb{R}^{K \times d}$ 7 This yields an affinity matrix between all $S_{t-1} \in \mathbb{R}^{K \times d}$ 8 slots and $S_{t-1} \in \mathbb{R}^{K \times d}$ 9 tokens.

(d) Token Aggregation and Slot Update

Update each slot with a weighted sum of token values: $S_{t-1}$ 0 Aggregate into matrix $S_{t-1}$ 1.

(e) Recurrent (GRU) Fusion

Update slots recurrently: $S_{t-1}$ 2 After all steps, output $S_{t-1}$ 3 as the compressed visual memory.

4. Pseudocode and Training Approach

The SHIC compression process can be summarized as follows:

$i = 1, \ldots, t-1$ 9

There is no explicit reconstruction or auxiliary loss in SHIC; its parameters (QKV projections, GRU, slot initialization) are updated jointly with the entire LongFly model. Supervised losses—such as those arising in PGM for waypoint regression—implicitly drive the training of SHIC parameters via back-propagation (Jiang et al., 26 Dec 2025).

5. Hyperparameters and Trade-Offs

The primary hyperparameters in SHIC are $S_{t-1}$ 4 (number of slots) and $S_{t-1}$ 5 (feature dimension):

K (slots): Controls the memory capacity and compression fidelity. Smaller $S_{t-1}$ 6 (e.g., 8) yields higher compression and lower compute, at the expense of semantic detail. Larger $S_{t-1}$ 7 ( $S_{t-1}$ 8) increases memory and fidelity but raises computational cost.
d (token/slot dimension): Determined by CLIP encoder, e.g. $S_{t-1}$ 9.
$L$ 0 (tokens per step): Typically $L$ 1, from 5 views.
Softmax temperature $L$ 2: Stabilizes gradient flow during slot-token attention.
Compression ratio: For $L$ 3, $L$ 4, $L$ 5:

$L$ 6

Computational complexity:
- Attention: $L$ 7 per step.
- Memory: $L$ 8, compared to $L$ 9 without compression.
Returns from increasing $i = 1, \ldots, t-1$ 0 diminish beyond $i = 1, \ldots, t-1$ 1; excess slots often become redundant.

These trade-offs control real-time performance, with practical $i = 1, \ldots, t-1$ 2 values ensuring that $i = 1, \ldots, t-1$ 3 fits deployment constraints while maximizing informative context retention.

6. Architectural Dynamics and Information Flow

Conceptually, SHIC instantiates a fixed set of $i = 1, \ldots, t-1$ 4 semantic “buckets,” each aggregating and abstracting distributed visual information across the trajectory:

At each time step, five camera images generate a “cloud” of $i = 1, \ldots, t-1$ 5 CLIP tokens.
The $i = 1, \ldots, t-1$ 6 slot nodes, initialized with learnable parameters, evolve recurrently over time as each attends to and integrates new observational tokens.
Slot attention yields incremental slot-wise updates, temporally propagating and refining landmark and contextual information through recurrent GRU fusion.
At the conclusion of the historical window, $i = 1, \ldots, t-1$ 7 comprises a checkpointed set of embeddings characterizing the salient visual context for integration with trajectory and instruction data in downstream navigation planning.

This recurrent, slot-centric abstraction scheme enables generalization to arbitrarily long observation histories, eliminating the need for computationally intensive global attention over all tokens.

7. Compression Efficacy and Limitations

SHIC achieves substantial compression of spatiotemporal visual history, with empirical compression ratios exceeding 3,000× under typical navigation horizons and tokenization schemes. The architecture explicitly bounds memory and per-step compute irrespective of visual history length.

A plausible implication is that, while increasing $i = 1, \ldots, t-1$ 8 improves representational fidelity, task performance improvements saturate rapidly and surplus slots add redundancy rather than new information. Moreover, the absence of explicit reconstruction loss may limit the preservation of certain fine-grained details. Nonetheless, in applied contexts such as vision-and-language navigation, SHIC demonstrates sufficient semantic precision, outperforming alternative memory architectures in both success rate and path-length-weighted success metrics (Jiang et al., 26 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slot-Based Historical Image Compression (SHIC).

Slot-Based Historical Image Compression

1. Role within the LongFly Pipeline

2. Inputs, Outputs, and Dataflow

3. Core Mathematical Operations

4. Pseudocode and Training Approach

5. Hyperparameters and Trade-Offs

6. Architectural Dynamics and Information Flow

7. Compression Efficacy and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Slot-Based Historical Image Compression

1. Role within the LongFly Pipeline

2. Inputs, Outputs, and Dataflow

3. Core Mathematical Operations

4. Pseudocode and Training Approach

5. Hyperparameters and Trade-Offs

6. Architectural Dynamics and Information Flow

7. Compression Efficacy and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research