In-Context Autoencoder (ICAE)

Updated 19 December 2025

ICAE is a lightweight module for LLMs that compresses long contexts into compact memory slots, preserving key sequence semantics with only about 1% extra parameters.
It combines a trainable encoder using LoRA adapters with a frozen LLM decoder to perform efficient autoencoding and effective downstream task performance.
Empirical results highlight high reconstruction fidelity and notable latency improvements, positioning ICAE as a promising approach for scalable transformer-based context management.

The In-context Autoencoder (ICAE) is a lightweight module for LLMs designed to address the long context problem by compressing lengthy text into compact “memory slots.” These memory slots, produced by a learned encoder and interpreted by a frozen LLM decoder, enable efficient context handling with minimal memory and computational overhead. ICAE involves only approximately 1% additional parameters over the base LLM, preserves sequence semantics with high fidelity, and demonstrates promising behavior both in autoencoding accuracy and downstream generation tasks. Its design draws inspiration from working memory in cognitive science and suggests a novel architectural direction for scalable context management in transformer-based models (Ge et al., 2023).

1. Model Architecture and Components

ICAE consists of two principal modules integrated around a pre-trained, frozen causal LLM (e.g., Llama):

Encoder ( $f_\theta$ ): The encoder maps a long input context $c = (w_1, ..., w_L)$ to a compact representation $z = (z_1, ..., z_k)$ , with $k \ll L$ memory slots. This is implemented by prepending $k$ trainable “memory token” embeddings ( $m_1, ..., m_k$ ) to $c$ and running the sequence through the LLM augmented with LoRA adapters (rank $r$ ) applied to the query and value projections. Only the LoRA parameters ( $\Theta_{LoRA}$ ) and the memory embedding lookup table ( $e_m$ ) are learned, with the LLM weights ( $\Theta_{LLM}$ ) kept frozen.
Decoder ( $\Theta_{LLM}$ ): The original LLM is reused, unmodified, to reconstruct the input context, generate continuations, or respond to prompts, all conditioned upon the compressed representation $z$ rather than the full context.

Formally, given context $c = (w_1, ..., w_L),$ memory slots are computed as $z = f_\theta(c) = (\hat{z}_1, ..., \hat{z}_k)$ , where each $\hat{z}_j \in \mathbb{R}^d$ corresponds to the hidden state at a memory token position.

2. Training Protocol and Objectives

ICAE employs a two-phase training regime:

Pretraining: The encoder and newly introduced components are optimized on large-scale raw text using the following combined loss:
- Autoencoding loss:
$\mathcal{L}_{AE}(\theta)= -\mathbb{E}_c \sum_{t=1}^L \log P(w_t \mid \hat{z}_1,...,\hat{z}_k,\text{[AE]}, w_{<t};\Theta_{LLM})$

where [AE] is a special decode-from-memory token. - Language modeling loss:

$\mathcal{L}_{LM}(\theta) = -\mathbb{E}_{c,o} \sum_{t=1}^N \log P(o_t \mid \hat{z}_1,...,\hat{z}_k, o_{<t};\Theta_{LLM})$ - Joint pretraining objective:

$\mathcal{L}_{pre}(\theta) = \lambda \mathcal{L}_{AE}(\theta) + (1-\lambda) \mathcal{L}_{LM}(\theta)$

with $\lambda \approx 0.4$ –$0.6$.
Instruction Fine-tuning: To enable the model to use memory slots for prompt-based tasks, fine-tuning is performed on a Prompt-with-Context (PwC) dataset comprising 240,000 tuples $(c,p,r)$ where $p$ is a GPT-4–generated instruction and $r$ an “ideal” response. The objective is:

$\mathcal{L}_{FT}(\theta) = -\mathbb{E}_{c,p,r} \sum_{t=1}^{|r|} \log P(r_t \mid \hat{z}_1,...,\hat{z}_k, p, r_{<t}; \Theta_{LLM})$

Inference involves encoding the context $c$ into memory slots $z=f_\theta(c)$ , concatenating with the prompt $p$ as input to the unchanged LLM, and generating the response.

3. Context Compression and Computational Efficiency

ICAE achieves a default compression ratio of $4\times$ , mapping $L=512$ tokens into $k=128$ memory slots. Empirical results demonstrate further scalability: concatenating four 512 $\rightarrow$ 128 compressors enables processing $4096$-token contexts at only double the cost of a $2048$-token context, reducing GPU VRAM consumption by approximately 20 GiB for a 7B Llama model.

The table below summarizes latency improvements:

Input (Batch×Len)	Compress Time (ms)	Decode Time (ms)	Total (ms, × speed-up)
8×2048 tokens	3.4	3.9	7.3 (3.3×)
8×512 tokens	0.6	3.7	4.3 (2.2×)
32×512 tokens	2.6	4.2	6.8 (3.6×)

These improvements make ICAE attractive for deployment in resource-constrained or latency-sensitive environments.

4. Empirical Evaluation

ICAE exhibits strong reconstruction fidelity, effective language modeling with compressed contexts, and robust downstream instruction response, as demonstrated across multiple LLMs:

Target LLM	BLEU (%)	Reconstruction Loss	ΔPPL (512→128 slots)
Llama-7B	99.1	0.017	+0.49
Llama-2-7B	99.5	0.009	+0.37
Llama-2-13B	99.8	0.004	+0.30

Memorization Patterns: For normal text compressed from 512 to 128 slots, ICAE yields loss 0.01 and BLEU 99.3%. Patterned random text yields loss 1.63 and BLEU 3.5%, while fully random sequences degrade to loss 4.55 and BLEU 0.2%. This reflects selective memorization, with structure and information-theoretic compressibility linked to reconstruction accuracy.

Downstream Evaluation: On PwC tasks with Llama-7B and $k=128$ , ICAE matches or exceeds baseline systems in human and GPT-4–based judgments:

System	vs Alpaca	vs StableLM	vs GPT-4	Win+Tie vs GPT-4
ICAE (128 slots)	56.7%	74.1%	30.6%	30.6%

Upgrading to Llama-2-7B-chat with $k=128$ increases the on-par rate against GPT-4 to approximately 74%.

5. Cognitive Analogy, Scalability, and Limitations

ICAE’s behavior has notable parallels to theories of working memory in cognitive science. Compression into memory slots results in minor paraphrasing and selective retention, resembling “recall errors” characteristic of human short-term memory. Empirical evidence indicates “knowledgeable” LLMs require fewer slots for accurate context recall, suggesting effective abstraction by stronger models.

Scalability: Improved LLM architectures (e.g., Llama-2-13B) facilitate even higher compression with minimal loss—BLEU up to 99.8% and $\Delta$ PPL reduced to +0.30, endorsing the approach for larger models.

Limitations and Future Work: Current evaluations are limited to models up to 13B parameters. Open research questions include extension to ultra-large and multimodal LLMs, exploration of discrete rather than continuous memory slots, and hierarchical or multi-span context compression strategies.

6. Implications for LLM Context Management

ICAE introduces an orthogonal solution to the long context challenge without disrupting pretrained model weights. By learning to encode entire contexts as compact, continuous embeddings that serve as direct surrogates for original text, ICAE substantially reduces inference costs and unlocks new operational regimes for LLMs in terms of sequence length, memory, and task generality. The observed parallels to cognitive memory and the model’s robustness across task domains underscore its relevance for the evolving landscape of efficient and scalable representation learning in LLMs (Ge et al., 2023).

Markdown Upgrade to Chat

References (1)

In-context Autoencoder for Context Compression in a Large Language Model (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-context Autoencoder (ICAE).