Selective Compression via Reinforcement Learning

Updated 7 September 2025

Selective compression via reinforcement learning is a field that leverages RL to automate and optimize the reduction of redundant data and model parameters while preserving essential performance.
The approach integrates techniques like policy evaluation, entropy regularization, and resource-constrained optimization to achieve significant reductions in model size and computational load.
Applications include deep network pruning, contextual prompt compression, and efficient policy space reduction, leading to improved sample efficiency and real-time adaptability.

Selective compression via reinforcement learning refers to a diverse and rapidly advancing set of methodologies where reinforcement learning (RL) is leveraged to automate, optimize, or directly control the process of compressing information, data, or policy/behavioral representations in learning agents and systems. These methods seek to identify and eliminate redundancy, irrelevant details, or parameter excess, while ensuring that informational, behavioral, or functional fidelity is retained for core downstream tasks. RL-driven selective compression has produced rigorous frameworks and strong empirically validated results in deep network model compression, context summarization for LLMs, policy space reduction, sensory and action sequence regularization, lifelong learning, and other domains.

1. Foundations and Theoretical Principles

The core principle underpinning selective compression in RL is to formally couple compression objectives—such as reduction in parameter count, memory, token length, or information bits—with performance metrics relevant to the system’s task, and to employ RL techniques (policy gradient, actor–critic, GDA, DQN, PPO, or specialized algorithms) as the optimization driver.

Several pivotal abstractions and techniques emerge:

Compression as Policy Evaluation: In “Compress and Control” (Veness et al., 2014), value estimation for an RL agent is reframed as density modeling (or compression) of observed experience, resulting in plug-in estimators for $Q^\pi(s,a)$ using density estimates that may be obtained from any probabilistic model (e.g., context tree weighting, Lempel-Ziv, autoregressive models), with rigorous convergence under stationarity and ergodicity assumptions.
Compression as Action/Sequence Priors: In “Reinforcement Learning with Simple Sequence Priors” (Saanum et al., 2023) and “Robust Predictable Control” (Eysenbach et al., 2021), RL objectives are regularized by penalizing the information content or complexity of policy outputs. This is formalized via entropy terms, explicit coding penalties (e.g., $\log \pi(a|s)$ , bits-back coding, or autoregressive sequence priors), or information bottleneck constraints on internal representations.
Compression as Resource-Constrained RL: In model compression works such as ADC (Hakkak, 2018), DECORE (Alwani et al., 2021), and two-stage DRL approaches (Zhan et al., 2019), the RL agent sequentially chooses compression hyperparameters (e.g., per-layer pruning ratios, quantization levels) for each part of a deep model to meet multi-objective reward functions balancing size, latency, energy, and accuracy.
Compression for Policy Space Coverage: “Reward-Free Policy Space Compression” (Mutti et al., 2022) seeks to represent an (uncountable) family of policies by a finite set, covering induced state–action distributions up to a divergence threshold, using set cover and game-theoretic formulations executed through gradient-based adversarial policy optimization.
Lossless Compression in Retrieval-Augmented Generation: In “CORE” (Cui et al., 24 Aug 2025), RL is employed to compress retrieved contexts for LLMs by maximizing end-task answer quality with a compressor trained by a policy-optimization objective, sidestepping hand-constructed compression targets.

2. Methodologies and Optimization Frameworks

Selective compression RL methods instantiate a diverse suite of algorithmic frameworks:

Approach	Compression Target	RL Framework
Model Compression (ADC, DECORE)	NN channels, weights, layers	Actor–critic, REINFORCE, PPO
Policy Space Compression	Distribution over policies	Gradient descent–ascent games
Sequence Compression	Action/observation sequence entropy	Entropy-regularized SAC, PPO
Experience Replay Compression	Experience sample sets	Bandit optimization, clustering
Document/Prompt Compression	Token selection for LMs	Policy gradient, SCST, GRPO

For model pruning and quantization, the RL agent often makes layer-wise or channel-wise decisions, guided by composite rewards (e.g., weighted sums of accuracy loss, FLOPs, memory). Policy optimization is conducted via variants of DDPG, PPO, REINFORCE, or light policy gradient updates, sometimes under resource constraints that are formalized as dual objectives with Lagrange multipliers (Grönland et al., 2023).

In high-dimensional, sequence, or black-box settings, the policy acts as a discrete selector (for tokens, prompt components, or actions), and is trained via single-step or group-based policy gradient updates with custom reward functions (e.g., answer quality for document compression, ROUGE for prompt compression) and variance control via entropy or KL-divergence regularization (Jung et al., 2023, Cui et al., 24 Aug 2025).

3. Applications Across Domains

RL-driven selective compression is now prevalent in several domains:

Deep Model Compression: ADC (Hakkak, 2018), DECORE (Alwani et al., 2021), ShrinkML (Dudziak et al., 2019), and the two-stage DRL approach (Zhan et al., 2019) demonstrate substantial reductions in model size (up to 99% in some VGG networks) and FLOPs (up to 61.8% in ResNet-110) while retaining or even improving accuracy. These methods automate finding per-layer sparsity/quantization rates, outperforming manual engineering.
Policy Compression and RL Generalization: Policy space compression (Mutti et al., 2022) achieves finite, sample-efficient policy sets with bounded divergence from any policy in the original space, enabling more efficient off-policy evaluation and robust, task-agnostic exploration.
Retrieval-Augmented and Prompt Compression: The CORE method (Cui et al., 24 Aug 2025) compresses large retrieval sets for LLMs, achieving up to 3% of original context size without degrading—and sometimes even improving—answer accuracy. Discrete prompt compression via RL (Jung et al., 2023) offers over 24% token reduction in instruction prompts, adaptive across black-box LMs.
Video and Signal Compression: Adaptive video compressive sensing (Lu et al., 2021) uses RL to dynamically select the optimal compression ratio (e.g., number of frames per measurement) balancing detection accuracy and bandwidth, enabling low-latency, real-time operation in edge settings.
Lifelong and Continual Learning Replay Compression: Reward distribution-preserving coreset compression (Zheng et al., 2023) for experience replay yields 10×–40× ERB size reductions in medical RL without significant localization performance drop, alleviating catastrophic forgetting and memory bottlenecks.

4. Performance Metrics, Trade-offs, and Empirical Findings

Reported empirical results across these works consistently show that RL-based selective compression can yield dramatic efficiency improvements with controlled or even improved performance metrics:

Model Size and Throughput: Compression ratios exceeding 30× on ImageNet/VGG-16 (Zhan et al., 2019), 99% size reduction (Alwani et al., 2021), and up to 400× neural network size reductions for RL agents in control/Atari settings (Ivanov et al., 13 May 2024), all with negligible or sometimes improved test accuracy.
End-Task Performance: In retrieval-augmented LLMs, CORE (Cui et al., 24 Aug 2025) achieves a 3.3-point improvement in average Exact Match at a 3% context compression ratio.
Sample Efficiency: Compression of policy space directly improves sample efficiency in offline evaluation and learning (Mutti et al., 2022).
Robustness and Interpretability: Compression-driven agents are more robust to observation noise (Saanum et al., 2023), exhibit reduced overfitting, and achieve higher generalization across tasks (Eysenbach et al., 2021).

Trade-offs are inevitable:

Aggressiveness vs. Fidelity: High compression ratios eventually degrade performance (e.g., 30×–40× coreset compression (Zheng et al., 2023)), but a modest range (10×–20×) yields negligible loss.
Reward Shaping and Stability: Careful reward crafting is crucial; poorly balanced rewards can degrade accuracy or cause unstable convergence (Hakkak, 2018, Zhan et al., 2019).
Scalability: Gradient-based set cover algorithms for policy space compression circumvent combinatorial explosion but provide only locally optimal solutions (Mutti et al., 2022).

5. Emerging Directions and Practical Implications

The successful deployment of RL-based selective compression is enabling:

Adaptive, Real-Time and Resource-Constrained AI: Systems can now dynamically adjust compression in response to changing resource and latency constraints, as demonstrated in C-RAN fronthaul optimization (Grönland et al., 2023) and adaptive image/video sensing (Lu et al., 2021).
Task-Independent Skill Discovery and Hierarchical RL: MDL-inspired compression of latent skill labels (Jiang et al., 2022) enables discovery of diverse, temporally extended skills that accelerate learning in new tasks, yielding high sample efficiency.
Black-box and Transferable Compression Policies: RL-based discrete prompt compression (Jung et al., 2023) and context compressors (Cui et al., 24 Aug 2025) are applicable across different LLMs, with policy transfer assured by token-level actions and reward signals derived from end-task outputs rather than architecture details.
Robustness and Redundancy Mitigation: Regularizing on information complexity, as in RPC (Eysenbach et al., 2021) and sequence prior agents (Saanum et al., 2023), yields policies that are robust to input noise, promote simplicity, and support open-loop control.

6. Mathematical Formulations and Algorithmic Patterns

Key mathematical constructs span:

Plug-in Value Estimators:

$\hat Q^\pi(s, a) = \sum_z z \cdot \frac{\rho_S(s \mid s_{0:n-1}^{z,a}) \rho_Z(z \mid z_{1:n}^{a})}{\sum_{z'} \rho_S(s \mid s_{0:n-1}^{z',a}) \rho_Z(z' \mid z_{1:n}^{a})}$

(from (Veness et al., 2014))

Compression-coded RL Objective:

$\max_{\pi} \mathbb{E}\left[ \sum_{t} \gamma^t (r(s_t, a_t) - \alpha C(a_t, s_t, \theta)) \right]$

with $C$ measuring sequence complexity per (Saanum et al., 2023, Eysenbach et al., 2021).

Policy Space Set Cover:

$\min_{\{\text{Policies }\pi_k\}_{k=1}^K} \max_{\pi \in \Theta} \min_{k \in [K]} D_2\big(d^{sa}_\pi \;||\; d^{sa}_{\pi_k}\big)$

solved via gradient descent–ascent to obtain a Stackelberg equilibrium (Mutti et al., 2022).

GRPO Objective for RL-Based Context Compression:

$J(\theta) = \mathbb{E} \Bigg[ \frac{1}{G} \sum_{i=1}^G \min \Bigg( \frac{\pi_\theta(y_i|x)}{\pi_{\theta_\text{old}}(y_i|x)} A_i, \operatorname{clip}\Big( ..., 1-\epsilon, 1+\epsilon\Big) A_i\Bigg) - \beta D_{KL}(\pi_\theta || \pi_{\theta_\text{ref}})\Bigg]$

(from (Cui et al., 24 Aug 2025)), where $A_i$ is the group-normalized advantage.

7. Broader Impact, Limitations, and Open Problems

Selective compression via reinforcement learning is reshaping the way resource constraints, interpretability, sample efficiency, and computational cost are addressed across the RL and deep learning landscape. Practical methods now integrate seamless RL-driven automation for model compression, contextual information pruning for LLMs, and simultaneous optimization for accuracy, bandwidth, latency, and robustness.

Current limitations and open questions include:

Unsupervised Reward Formulation: The selection and balancing of reward terms remains highly application-dependent and often heuristic.
Scalability and Search Space: While RL-based search outperforms brute-force or manual engineering, high-dimensional or continuous design spaces (e.g., policy sets, per-layer architectural parameters) continue to pose optimization challenges.
Transferability and Generalization: Although some works demonstrate transfer of compression policies across models and deployment contexts, formal guarantees and large-scale empirical evaluations are still in development.
Extension Beyond Vision and NLP: While domains such as wireless systems and lifelong learning have benefited, adaptive compression for complex multimodal or multi-agent scenarios remains a fertile area for exploration.

In sum, the field has advanced from theoretical connections between compression and value estimation (Veness et al., 2014) to large-scale practical RL-driven compression systems that deliver robust, competitive, and efficient models and policies across modalities and domains (Cui et al., 24 Aug 2025, Alwani et al., 2021, Saanum et al., 2023). This trajectory suggests ongoing expansion and deepening integration of selective compression within future reinforcement learning architectures.