Compressed Input Representations

Updated 24 May 2026

Compressed input representations are algorithmically or learned transformations that convert raw data into concise formats with reduced entropy and improved efficiency.
They employ dynamic programming, deep learning, and variational techniques to achieve significant compression ratios while ensuring near-lossless reconstruction.
Applications include long-context language models, multimodal systems, and code retrieval, though challenges such as generalization and token overflow remain.

Compressed input representations are algorithmically or learned transformations of raw data into compact forms that facilitate storage, transmission, efficient computation, or improved generalization. Methods span theoretical, algorithmic, and deep learning domains, encompassing lossless entropy minimization, semantic tokenization, information-theoretic and variational learning, and specialized schemes for modalities including language, images, audio, and code. Compressed input representations are pivotal across tasks such as retrieval-augmented generation, multimodal modeling, efficient search, and neural network training and inference.

1. Entropy-Bounded and Parsing-Based Compression

A classic formalization of compressed input representations is the construction of parsings (factorizations) of a string $S$ into phrases $y_1\ldots y_{|Y_S|}$ of bounded length $m$ , such that the empirical entropy $|Y_S|\,H_k(Y_S)$ (zeroth or first order, where $H_k(\cdot)$ denotes $k$ -th order entropy) is minimized. Computing the exact minimum-entropy parsing is intractable due to the non-locality of entropy. Efficient dynamic programming heuristics operate by assigning to each candidate phrase $y$ a cost $-\log_2 p_{H_0}(y)$ , where

$p_{H_0}(y) = \frac{1}{m}\cdot\frac{|y|}{|S|}.$

Recurrence relations enable $O(|S|m)$ time algorithms for the $y_1\ldots y_{|Y_S|}$ 0 case, and $y_1\ldots y_{|Y_S|}$ 1 for first-order entropy minimization. Theoretical bounds show the total entropy of the parsing is upper-bounded (for any $y_1\ldots y_{|Y_S|}$ 2) as

$y_1\ldots y_{|Y_S|}$ 3

where $y_1\ldots y_{|Y_S|}$ 4 is the alphabet size, and similar bounds hold for the $y_1\ldots y_{|Y_S|}$ 5 case. This methodology directly improves both dictionary and entropy-compressed text representations and supports $y_1\ldots y_{|Y_S|}$ 6 random access with minimal redundancy. Empirically, such approaches yield 1–8% drops in entropy compared to equal-length parsing and reduce phrase dictionaries by up to 50% (Gańczorz, 2019).

2. Generalization and Information-Theoretic Limits of Compressed Inputs

Compressed input representations alter the statistical and computational properties of learning algorithms. An information-theoretic analysis establishes that if an input space $y_1\ldots y_{|Y_S|}$ 7 has entropy $y_1\ldots y_{|Y_S|}$ 8 and a compression mapping is fixed in advance, the PAC sample complexity required to achieve $y_1\ldots y_{|Y_S|}$ 9-accuracy is bounded by

$m$ 0

This exponential dependence on $m$ 1 explains the generalization benefits of low-entropy, pre-defined compressed representations. However, if the encoder is learned from the same data used to train the classifier—such as in end-to-end deep learning or the Information Bottleneck (IB) approach—covering arguments become vacuous, and no non-trivial generalization guarantee persists. The dual challenges are the double-exponential dependence on $m$ 2 and the breakdown of arguments when feature compression itself is optimized using labels (Hafez-Kolahi et al., 2019).

3. Tokenized and Learned Soft Compression for Language and Multimodal Systems

Recent models implement compressed representations by replacing long input sequences with learned compressed tokens ("memory", "gist", "compressed") which summarize context for large transformer models. The position of such tokens is crucial: with Rotary Positional Embedding (RoPE) mechanisms, assigning compressed tokens position IDs that are locally near their corresponding source segments yields significantly higher attention weights and improves reconstruction and downstream task quality (e.g., BLEU-4, ROUGE). Enhanced Position Layout (EPL) formalizes this by interleaving $m$ 3 compressed tokens among $m$ 4 originals with positions: $m$ 5 so that each compressed token is placed close to the tokens it summarizes. EPL improves viable compression ratios (e.g., near-lossless reconstruction at $m$ 6 compression) and accelerates convergence compared to naively appending compressed tokens to the end of the sequence (Zhao et al., 2024).

In retrieval-augmented generation (RAG) and reranking, entire input documents are projected into fixed-size embeddings—typically using a small set of learned memory tokens or a soft-compression module, as in the PISCO architecture. Downstream models such as RRK (Reranking with Compressed Document Representation) consume only these embeddings (e.g., 8 memory tokens per document) concatenated with queries, maintaining state-of-the-art effectiveness at a fraction of latency and computational cost. Listwise and pointwise distillation procedures are used to teach the reranker to mimic a strong teacher model given only the compressed input (Déjean et al., 21 May 2025, Déjean et al., 29 Apr 2026).

4. Compressed Representations in Multimodal and Code Domains

In vision and multimodal modeling, compressed continuous representations supplant both discrete tokenizers and high-dimensional raw features. For example, UniCom compresses SigLIP-2 features $m$ 7 (e.g., $m$ 8, $m$ 9) to $|Y_S|\,H_k(Y_S)$ 0 with $|Y_S|\,H_k(Y_S)$ 1 (e.g., $|Y_S|\,H_k(Y_S)$ 2), using a shallow Transformer and channel-only reduction. This preserves semantic and spatial information, accelerates training by $|Y_S|\,H_k(Y_S)$ 3, yields near-lossless reconstructions (e.g., rFID 0.42 vs. 0.38), and supports high-fidelity image generation and editing without variational autoencoder latents (Zhao et al., 11 Mar 2026).

Similarly, in code completion, frameworks like LLavaCode compress each retrieved code passage into a single continuous vector using a projector module, reducing prompt length from $|Y_S|\,H_k(Y_S)$ 4 to $|Y_S|\,H_k(Y_S)$ 5 tokens and decreasing time-to-first-token (TTFT) by 20–38% compared to full retrieval-augmented generation pipelines, while maintaining semantic efficacy as measured by exact match (EM) and edit similarity (ES) (Cherniuk et al., 22 Oct 2025).

5. Algorithmic and Lossless Compression: Classical and Hybrid Approaches

Beyond learned representations, classic algorithmic compressed representations remain foundational. Direct conversion between grammar-based, Lempel-Ziv, run-length, and other compressed formats is feasible in time polynomial in input and output compressed sizes, bypassing full decompression. This is realized using symbolic manipulation of compressed representations, compressed-pattern-matching, and structure-preserving transformations—enabling format conversion even for exponentially compressed data sets (Goto et al., 2011).

Lossless schemes also include ternary-binary coding exploiting patterns in ternary digit strings (such as the B₂₃ mapping, which leverages the "12" ternary pair for bit reduction) (Katugampola, 2010), and unconventional number-theoretic schemes like Logarithmic Positional Partition Interval Encoding (LPPIE). LPPIE applies iterative base-10 logarithms to substrings of decimal input, storing each as a (mantissa, iteration count) pair. This approach achieves compression ratios up to $|Y_S|\,H_k(Y_S)$ 6 over ZIP, but at very high computational cost and demanding high-precision arithmetic (Alevizos et al., 2024).

6. Information-Theoretic and Variational Learning Perspectives

Compressed sensing, information bottlenecks, and variational learning connect compressed representation learning with fundamental theory. Uncertainty Autoencoders (UAEs) frame the compressed representation as a noisy channel, maximizing a variational lower bound on $|Y_S|\,H_k(Y_S)$ 7 via end-to-end learned (possibly nonlinear) encoders and amortized decoders. This approach subsumes PCA in the high-noise limit, classical compressed sensing with known measurement matrices, and generative modeling. Empirical results show UAEs outperform baseline compressed sensing and generative approaches by about 32% in reconstruction error (grover et al., 2018).

In sequential data, Compressed Predictive Information Coding (CPIC) seeks to minimize input–compression complexity while maximizing predictability in latent space. This is formalized as the objective: $|Y_S|\,H_k(Y_S)$ 8 and optimized by variational bounds (Barber–Agakov, InfoNCE) with stochastic encoders and critics. Stochasticity in the encoder robustifies extraction of predictive latent structure, surpassing deterministic Gaussian methods for recovering latent dynamics under severe noise (Meng et al., 2022).

7. Applications, Limitations, and Frontiers

Applications of compressed input representations include:

Long-context language and multimodal models, allowing $|Y_S|\,H_k(Y_S)$ 9– $H_k(\cdot)$ 0 context reductions without accuracy loss (Zhao et al., 2024, Déjean et al., 21 May 2025, Zhao et al., 11 Mar 2026).
Accelerated inference and reranking with constant input length for large document sets (Déjean et al., 29 Apr 2026).
Lossless or lossy semantic compression to optimize context window usage in LLM-based pipelines (Gilbert et al., 2023).
Efficient storage, transmission, and symbolic reasoning (e.g., retrograde game solving using DFA-compressed position sets, achieving sublinear space bounds) (Considine, 2024).

However, limitations are prominent: generalization guarantees for learned compressions collapse if feature maps are data-dependent (Hafez-Kolahi et al., 2019). In token-based compression, the risk of "token overflow"—where the compressed input lacks information sufficient for downstream tasks—necessitates joint query-context probing and detection, as naive statistics on the compressed tokens are insufficient for overflow diagnosis (Belikova et al., 12 Feb 2026). Some number-theoretic or non-standard methods (e.g., LPPIE) yield extreme compression at prohibitive processing cost (Alevizos et al., 2024).

Overall, compressed input representations represent a confluence of theoretical, algorithmic, and practical innovations—each informed by domain-specific requirements and fundamental information-theoretic limits. They remain indispensable for scaling modern machine learning, retrieval, and multimodal systems.