Papers
Topics
Authors
Recent
2000 character limit reached

In-Context Autoencoder (ICAE)

Updated 19 December 2025
  • ICAE is a lightweight module for LLMs that compresses long contexts into compact memory slots, preserving key sequence semantics with only about 1% extra parameters.
  • It combines a trainable encoder using LoRA adapters with a frozen LLM decoder to perform efficient autoencoding and effective downstream task performance.
  • Empirical results highlight high reconstruction fidelity and notable latency improvements, positioning ICAE as a promising approach for scalable transformer-based context management.

The In-context Autoencoder (ICAE) is a lightweight module for LLMs designed to address the long context problem by compressing lengthy text into compact “memory slots.” These memory slots, produced by a learned encoder and interpreted by a frozen LLM decoder, enable efficient context handling with minimal memory and computational overhead. ICAE involves only approximately 1% additional parameters over the base LLM, preserves sequence semantics with high fidelity, and demonstrates promising behavior both in autoencoding accuracy and downstream generation tasks. Its design draws inspiration from working memory in cognitive science and suggests a novel architectural direction for scalable context management in transformer-based models (Ge et al., 2023).

1. Model Architecture and Components

ICAE consists of two principal modules integrated around a pre-trained, frozen causal LLM (e.g., Llama):

  • Encoder (fθf_\theta): The encoder maps a long input context c=(w1,...,wL)c = (w_1, ..., w_L) to a compact representation z=(z1,...,zk)z = (z_1, ..., z_k), with kLk \ll L memory slots. This is implemented by prepending kk trainable “memory token” embeddings (m1,...,mkm_1, ..., m_k) to cc and running the sequence through the LLM augmented with LoRA adapters (rank rr) applied to the query and value projections. Only the LoRA parameters (ΘLoRA\Theta_{LoRA}) and the memory embedding lookup table (eme_m) are learned, with the LLM weights (ΘLLM\Theta_{LLM}) kept frozen.
  • Decoder (ΘLLM\Theta_{LLM}): The original LLM is reused, unmodified, to reconstruct the input context, generate continuations, or respond to prompts, all conditioned upon the compressed representation zz rather than the full context.

Formally, given context c=(w1,...,wL),c = (w_1, ..., w_L), memory slots are computed as z=fθ(c)=(z^1,...,z^k)z = f_\theta(c) = (\hat{z}_1, ..., \hat{z}_k), where each z^jRd\hat{z}_j \in \mathbb{R}^d corresponds to the hidden state at a memory token position.

2. Training Protocol and Objectives

ICAE employs a two-phase training regime:

  • Pretraining: The encoder and newly introduced components are optimized on large-scale raw text using the following combined loss:

    • Autoencoding loss:

    LAE(θ)=Ect=1LlogP(wtz^1,...,z^k,[AE],w<t;ΘLLM)\mathcal{L}_{AE}(\theta)= -\mathbb{E}_c \sum_{t=1}^L \log P(w_t \mid \hat{z}_1,...,\hat{z}_k,\text{[AE]}, w_{<t};\Theta_{LLM})

    where [AE] is a special decode-from-memory token. - Language modeling loss:

    LLM(θ)=Ec,ot=1NlogP(otz^1,...,z^k,o<t;ΘLLM)\mathcal{L}_{LM}(\theta) = -\mathbb{E}_{c,o} \sum_{t=1}^N \log P(o_t \mid \hat{z}_1,...,\hat{z}_k, o_{<t};\Theta_{LLM}) - Joint pretraining objective:

    Lpre(θ)=λLAE(θ)+(1λ)LLM(θ)\mathcal{L}_{pre}(\theta) = \lambda \mathcal{L}_{AE}(\theta) + (1-\lambda) \mathcal{L}_{LM}(\theta)

    with λ0.4\lambda \approx 0.4–$0.6$.

  • Instruction Fine-tuning: To enable the model to use memory slots for prompt-based tasks, fine-tuning is performed on a Prompt-with-Context (PwC) dataset comprising 240,000 tuples (c,p,r)(c,p,r) where pp is a GPT-4–generated instruction and rr an “ideal” response. The objective is:

    LFT(θ)=Ec,p,rt=1rlogP(rtz^1,...,z^k,p,r<t;ΘLLM)\mathcal{L}_{FT}(\theta) = -\mathbb{E}_{c,p,r} \sum_{t=1}^{|r|} \log P(r_t \mid \hat{z}_1,...,\hat{z}_k, p, r_{<t}; \Theta_{LLM})

Inference involves encoding the context cc into memory slots z=fθ(c)z=f_\theta(c), concatenating with the prompt pp as input to the unchanged LLM, and generating the response.

3. Context Compression and Computational Efficiency

ICAE achieves a default compression ratio of 4×4\times, mapping L=512L=512 tokens into k=128k=128 memory slots. Empirical results demonstrate further scalability: concatenating four 512\rightarrow128 compressors enables processing $4096$-token contexts at only double the cost of a $2048$-token context, reducing GPU VRAM consumption by approximately 20 GiB for a 7B Llama model.

The table below summarizes latency improvements:

Input (Batch×Len) Compress Time (ms) Decode Time (ms) Total (ms, × speed-up)
8×2048 tokens 3.4 3.9 7.3 (3.3×)
8×512 tokens 0.6 3.7 4.3 (2.2×)
32×512 tokens 2.6 4.2 6.8 (3.6×)

These improvements make ICAE attractive for deployment in resource-constrained or latency-sensitive environments.

4. Empirical Evaluation

ICAE exhibits strong reconstruction fidelity, effective language modeling with compressed contexts, and robust downstream instruction response, as demonstrated across multiple LLMs:

Target LLM BLEU (%) Reconstruction Loss ΔPPL (512→128 slots)
Llama-7B 99.1 0.017 +0.49
Llama-2-7B 99.5 0.009 +0.37
Llama-2-13B 99.8 0.004 +0.30

Memorization Patterns: For normal text compressed from 512 to 128 slots, ICAE yields loss 0.01 and BLEU 99.3%. Patterned random text yields loss 1.63 and BLEU 3.5%, while fully random sequences degrade to loss 4.55 and BLEU 0.2%. This reflects selective memorization, with structure and information-theoretic compressibility linked to reconstruction accuracy.

Downstream Evaluation: On PwC tasks with Llama-7B and k=128k=128, ICAE matches or exceeds baseline systems in human and GPT-4–based judgments:

System vs Alpaca vs StableLM vs GPT-4 Win+Tie vs GPT-4
ICAE (128 slots) 56.7% 74.1% 30.6% 30.6%

Upgrading to Llama-2-7B-chat with k=128k=128 increases the on-par rate against GPT-4 to approximately 74%.

5. Cognitive Analogy, Scalability, and Limitations

ICAE’s behavior has notable parallels to theories of working memory in cognitive science. Compression into memory slots results in minor paraphrasing and selective retention, resembling “recall errors” characteristic of human short-term memory. Empirical evidence indicates “knowledgeable” LLMs require fewer slots for accurate context recall, suggesting effective abstraction by stronger models.

Scalability: Improved LLM architectures (e.g., Llama-2-13B) facilitate even higher compression with minimal loss—BLEU up to 99.8% and Δ\DeltaPPL reduced to +0.30, endorsing the approach for larger models.

Limitations and Future Work: Current evaluations are limited to models up to 13B parameters. Open research questions include extension to ultra-large and multimodal LLMs, exploration of discrete rather than continuous memory slots, and hierarchical or multi-span context compression strategies.

6. Implications for LLM Context Management

ICAE introduces an orthogonal solution to the long context challenge without disrupting pretrained model weights. By learning to encode entire contexts as compact, continuous embeddings that serve as direct surrogates for original text, ICAE substantially reduces inference costs and unlocks new operational regimes for LLMs in terms of sequence length, memory, and task generality. The observed parallels to cognitive memory and the model’s robustness across task domains underscore its relevance for the evolving landscape of efficient and scalable representation learning in LLMs (Ge et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to In-context Autoencoder (ICAE).