CompressARC: Advanced Compression Methods

Updated 9 December 2025

CompressARC is a suite of advanced compression techniques spanning neural context, visual reasoning, large archive retrieval, and adaptive database indexing.
The ARC-Encoder method compresses LLM context 4x to 8x by pooling token representations, preserving near open‐book performance while enabling multi-decoder adaptation.
RLZ-based archive compression and adaptive column strategies offer high compression ratios and rapid random access, while MDL-driven models enable data-efficient visual reasoning.

CompressARC refers to a collection of advanced compression techniques and systems applied across distinct domains, most notably in (1) neural context compression for LLMs (Pilchen et al., 23 Oct 2025), (2) data-efficient learning for ARC-AGI visual reasoning benchmarks (Liao et al., 5 Dec 2025), (3) efficient random-access archiving of large textual datasets via RLZ (relative Lempel-Ziv) (Petri et al., 2016), and (4) adaptive integer column compression for in-memory self-driving databases (Fehér et al., 2022). Each instantiation introduces a unique algorithmic and engineering approach to compression, emphasizing specific trade-offs between efficiency, fidelity, generalization ability, and deployment feasibility.

1. CompressARC for Neural Context Compression in LLMs

CompressARC, also formalized as ARC-Encoder, constitutes a plug-and-play “soft” context compressor designed for transformer-based LLMs without modifying their architectures (Pilchen et al., 23 Oct 2025). The architecture consists of a Transformer-based encoder, a pooling operation in the last self‐attention block, and a two-layer MLP projector.

Given a sequence of $n$ tokens, ARC-Encoder emits $n/x$ continuous representations (with $x=4$ or $8$ as typical pooling factors), which are injected directly into the decoder’s embedding layer. The decoder is frozen, and only the encoder/MLP are trainable. Pooling is performed by averaging non-overlapping $x$ -length chunks of queries in the last encoder layer:

$Q_{\rm pool}[i] = \frac{1}{x} \sum_{j=0}^{x-1} Q[x\,i + j]$

The encoder can be trained to serve multiple decoder LLMs simultaneously, with decoder-specific MLP adapters representing $<1\%$ of encoder parameters.

Training utilizes a mix of two alternating objectives: (a) reconstruction of the original sequence, and (b) continuation—predicting the sequence after a compressed segment. The composite loss is

$\mathcal{L}_{\rm pretrain} = \alpha\,\mathcal{L}_{\rm rec} + (1-\alpha)\,\mathcal{L}_{\rm cont}$

with $\alpha\approx 0.2$ optimal. Fine-tuning preserves in-context learning by interleaving few-shot examples and restricts updates to the encoder and MLP only.

Empirical results show that at $4\times$ compression, ARC-Encoder achieves nearly open-book performance on QA, translation, and summarization benchmarks with $\sim2\times$ prefill speed-up ( $1.8\times$ measured). At $8\times$ compression, there is increased degradation on tasks requiring token-level precision, but performance remains above the no-context (closed-book) baseline. The system generalizes across LLM families such as Llama and Mistral with minimal per-decoder adaptation cost.

2. CompressARC in ARC-AGI Without Pretraining (MDL-driven Visual Reasoning)

CompressARC in the ARC-AGI domain designates a 76,000-parameter model that forgoes pretraining and instead optimizes a Minimum Description Length (MDL) criterion at inference, learning from scratch on a per-puzzle basis (Liao et al., 5 Dec 2025). The objective is to discover the model/program $h^*$ minimizing total length

$h^* = \argmin_h\bigl\{L(h) + L(D \mid h)\bigr\}$

where $L(h)$ and $L(D|h)$ measure the bits to encode the model and the data, respectively. CompressARC converts this to a differentiable variational form using a Gaussian latent seed $z$ and a neural decoder $\theta$ , yielding the loss

$L(\mu, \Sigma, \theta) = \mathrm{KL}\left(\mathcal{N}(\mu,\Sigma) \,\|\, \mathcal{N}(0, I)\right) + \mathrm{CrossEntropy}(\operatorname{logits}_\theta(z), P_{\rm known})$

The architecture leverages multitensor representations and group-equivariant networks, supporting operations such as mean reduce/broadcast, softmax sharpening across spatial or color dimensions, and geometric directional communication. All training proceeds at inference, with no use of any train set or pretrained weights.

Empirical results on the ARC-AGI benchmark show that CompressARC solves 20% of evaluation puzzles (pass@2), a substantial rate considering the zero-shot, no-pretraining condition. The model exhibits strong inductive biases in geometric reasoning (object localization, infilling, directional extension), but is limited in long-range recurrence and multi-step algorithmic reasoning. CompressARC stands out by providing a compression-based alternative to large-scale pretraining for intelligence benchmarks.

3. CompressARC for Large-Scale Archive Compression (RLZ for Random Access)

In high-redundancy, large-scale archives such as web or ARC files, CompressARC refers to an RLZ-based system for efficient compression and rapid random-access retrieval (Petri et al., 2016). The method constructs a semi-static dictionary $D$ by uniform sampling from the archive $C$ , partitions the data into fixed-length blocks, then applies greedy factorization of each block against $D$ .

Each block is encoded as a sequence of $\langle$ offset, length $\rangle$ factors referencing $D$ , with literals used for unmatched bytes. Factor streams are then encoded with static bit-width and variable-byte codes. The compressed ratio $R(D,B)$ (compressed size over original size) and access cost $T_{\rm access}(D,B,I)$ are derived analytically; empirically, $R$ in the 15–25% range is attainable with O(0.3–0.4) ms block random-access latency on SSD for $B=64$ KiB and $D=256$ MiB.

In practical terms, RLZ-based CompressARC provides near-optimal compression with sub-millisecond random-access on SSDs—which is essential for workloads requiring intermittent, partial retrieval, such as web search archives. System integration requires placing $D$ and a block index in memory, and, optionally, incorporating block priming or three-stream extensions for further marginal improvements.

4. CompressARC for Adaptive Column Compression in Databases

In columnar in-memory databases, CompressARC specifies an adaptive integer-column compressor based on generalized deduplication (GD) and the "LastBit" transformation (Fehér et al., 2022). Each 32-bit $x$ is split as $d = x \bmod 2^D$ , $b = x - d$ , where $D$ (deviation size) is chosen adaptively, and bases $B$ are deduplicated and stored in a sorted array, along with compressed deviation indices.

Four GD variants mediate trade-offs between access speed, scan speed, and compression ratio, using per-value arrays, deduplication, and per-base deviation organization. The segmentation adaptation mechanism profiles all potential $D$ , normalizes compression and latency metrics, and selects $D^*$ to maximize an application-weighted utility function:

$U(D) = w_c \cdot \hat{G}(D) - w_r \cdot \hat{r}(D) - w_s \cdot \hat{s}(D) - w_t \cdot \hat{t}(D)$

where $w_c, w_r, w_s, w_t$ are tunable for compression, random access, sequential access, and scan, respectively.

Integrations into systems such as Hyrise yield compression notably superior to PFoR (5–15% better) with only 20–30% query overhead, and $8$– $15\times$ faster access compared to LZ4. CompressARC supports both late- and early-materialization workloads, automatically retraining segments as access patterns change.

5. Performance Trade-offs and Practical Considerations

CompressARC implementations are distinguished by their explicit attention to efficiency and accuracy under domain constraints:

ARC-Encoder neural compressors maintain few-shot LLM capabilities, with $2\times$ speed-up at $4\times$ context compression and minimal loss in downstream metrics (average EM drops from 49.2 to 45.5 at PF=4).
RLZ-based archive compressors achieve $16$– $18\%$ compression with random access latency $<$ 0.4 ms (SSD) and $<10$ ms (HDD), suitable for large datasets with sporadic access.
Adaptive columnar CompressARC attains $47$– $63\%$ compression on synthetic integer data (GD1-LM, default) and up to $63\%$ on real workloads, preserving random-access/query efficiency within $2-3\times$ of PFoR.
MDL-driven CompressARC models in ARC-AGI, despite severe data constraints, achieve $20\%$ solve rates, far surpassing baselines that use neither pretraining nor additional data.

Practical limitations include performance degradation with extreme compression factors ( $x>8$ in ARC-Encoder), supply of adequate RAM for the RLZ dictionary, and limited expressive power for MDL-driven models on algorithmically deep ARC puzzles. Configuration parameters—such as pooling factor in LLMs, dictionary/block sizes in RLZ, and deviation size in column compression—should be tuned to balance accuracy, speed, and resource constraints.

6. Research Impact and Future Directions

CompressARC methodologies represent state-of-the-art solutions across diverse compression challenges, with broad implications:

In LLM applications, portable soft-compression (ARC-Encoder) enables scalable, efficient prompt engineering and retrieval-augmented reasoning, decoupling encoder and decoder development across model families (Pilchen et al., 23 Oct 2025).
MDL-based puzzle solvers suggest that data-efficient intelligence is feasible via per-instance inference, motivating exploration of richer model classes and more efficient optimization for universal induction in vision (Liao et al., 5 Dec 2025).
RLZ-based compression is highly relevant for large-scale NLP, scientific, and web archives, providing precise engineering guidance for system builders (Petri et al., 2016).
Adaptive database compression addresses a longstanding gap in self-driving DBMSs, with automatic balancing of compression and access cost based on workload and data patterns (Fehér et al., 2022).

A plausible implication is that the convergence of soft, universal compression principles (MDL, RLZ, soft neural pooling) with practical system design will become increasingly critical as dataset scale and diversity increase. Further research may address multi-modal context compression, learnable block-dictionary optimization in RLZ, and compression-augmented self-supervised learning.

References

Application Area	Approach/Implementation	arXiv ID
Neural context compression (LLMs)	ARC-Encoder (soft compressor)	(Pilchen et al., 23 Oct 2025)
Visual reasoning (ARC-AGI puzzles)	MDL-driven, no-pretraining model	(Liao et al., 5 Dec 2025)
Archive compression & random access	RLZ-based CompressARC	(Petri et al., 2016)
Column-store database compression	GD-based adaptive CompressARC	(Fehér et al., 2022)

Markdown Upgrade to Chat

References (4)

ARC-Encoder: learning compressed text representations for large language models (2025)

ARC-AGI Without Pretraining (2025)

Access Time Tradeoffs in Archive Compression (2016)

An Adaptive Column Compression Family for Self-Driving Databases (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CompressARC.