- The paper introduces a sample-specific neural compression method that overfits tiny networks to individual images or videos, achieving competitive rate-distortion performance.
- The paper employs innovations like soft-rounding, Kumaraswamy noise, and conditional entropy modeling to drastically reduce decoding complexity to under 3k MACs/pixel for images and 5k for videos.
- The study demonstrates that per-instance tailored optimization results in efficient compression with quality on par with state-of-the-art codecs, ideal for resource-constrained applications.
Overview and Motivation
C3 presents a significant advancement in neural compression by focusing on single-image or single-video instance optimization. Departing from conventional general-purpose neural codecs that require dataset-scale training and heavyweight decoders, C3 aggressively overfits small neural architectures directly to each sample. This yields rate-distortion (RD) performance competitive with state-of-the-art classical codecs like VTM (H.266) and neural codecs such as VCT, yet dramatically reduces decoding complexity to under 3k MACs/pixel for images and 5k MACs/pixel for videos.
This strategy is motivated by the observation that generalization across unseen data typically forces neural codecs to use deep, overparameterized decoders unsuitable for resource-constrained deployments. By dedicating bespoke, tiny models to individual samples, C3 unlocks extreme decoder efficiency while matching or exceeding the compression efficacy of much larger models.
(Figure 1)
Figure 1: Rate-distortion trade-off versus decoding complexity on the Kodak benchmark, highlighting C3 as the most favorable approach among neural codecs.
Core Methodology and Architectural Innovations
Foundation: Single-Instance Neural Fields and COOL-CHIC
C3 builds upon the COOL-CHIC framework, which fits all model components—multi-resolution latent grids, a synthesis transform, and an autoregressive entropy model—per instance. The latent grids capture both coarse and fine spatial structure through hierarchical downsampling, and are upsampled before being decoded to RGB.
Rather than relying on a shared encoder, all weights and latents are quantized and entropy-coded at the end of instance-specific training. Both the entropy and synthesis networks are extremely shallow (depth ≤ 4, width ≤ 40).
(Figure 2)
Figure 2: C3/COOL-CHIC decoding pipeline: autoregressive entropy coding of latent grids (A), upsampling and synthesis of RGB output (B).
Improvements in Quantization-Aware Optimization
C3 introduces multiple optimizations to quantization-aware training:
- Soft-Rounding: Employs a temperature-controlled smooth approximation of rounding for latent quantization, improving the fidelity of optimization and gradient flow.
- Kumaraswamy Noise: Replaces additive uniform noise with analytically tractable Kumaraswamy noise, adjustable to match the distributional structure of quantization errors, thereby improving convergence.
- Cosine Learning Rate Decay: Optimizes hyperparameter annealing for stability and fast convergence.
- Finer Quantization Step Sizes: Latents are quantized in sub-integer steps to control input magnitude and network stability.
- Adaptive Learning Rates: Learning rates decrease in response to plateaued objective values, especially in the quantized Stage 2 optimization.
- Stage-specific Estimators: Stage 1 uses annealed soft-rounding; Stage 2 employs adaptive soft-rounding for backpropagation rather than crude straight-through estimators.
(Figure 3)
Figure 3: The soft-rounding and Kumaraswamy noise schedule, showing reduced quantization error and improved gradient properties across optimization.
Architectural Enhancements
- Conditional Entropy Modeling: Contexts for latent prediction can include information from lower-resolution grids, increasing expressivity for correlated multi-scale representations.
- Resolution-Dependent Entropy Networks: Via FiLM conditioning or separate networks per grid, the entropy model adapts to grid resolution.
- GELU Activations: Replaces ReLU with GELU, offsetting expressiveness limits of small networks.
- Shifted Log-scale Parameterization: Shifting the log-scale for Laplace entropy output parameters optimizes initialization and convergence.
- Adaptive Per-instance Sweeps: Hyperparameters and architecture choices are tuned for each sample to maximize RD trade-off.
Video-Specific Methodology
To handle the temporal dimension in video, C3 generalizes latent grids and entropy contexts to 3D (time × space), optimized in patches for tractability. Temporal contexts are expanded to capture fast keypoint displacements (see Figure 4), and causal custom masks are learned for entropy prediction using video motion statistics—limiting entropy model parameter growth and maintaining efficiency.
(Figure 5)
Figure 5: Jockey sequence from UVG and keypoint displacement analyzed via optical flow, motivating expanded context windows for video entropy coding.
Experimental Results and Comparative Analysis
Image Compression
On the CLIC2020 benchmark, C3 delivers RD performance nearly equal to VTM, achieving below 3k MACs/pixel—an order of magnitude lower than neural codecs with similar quality. C3 Adaptive even surpasses VTM (-2.0% BD-rate), showing that per-instance architectural sweeps are highly advantageous. Ablation studies attribute the majority of gains to soft-rounding, GELU, and Kumaraswamy noise. Typical decoding times are <100ms per 768×512 image on CPU, even with serial autoregressive entropy rollouts.
(Figure 6)
Figure 6: Rate-distortion curve for CLIC2020 with C3 outmatching most neural and classical baselines at dramatically reduced complexity.
Qualitative comparisons reveal that C3 eliminates artifacts observed in prior single-instance methods at similar bitrates.
(Figure 7)
Figure 7: Visual artifact comparison between C3 (top) and COOL-CHICv2 (bottom) for a CLIC2020 image.
Video Compression
On the UVG-1k benchmark, C3 matches the performance of VCT but with under 0.1% of its MACs/pixel. It consistently outperforms FFNeRV, and though trailing HiNeRV and MIMT at the highest RD levels, it maintains extreme efficiency. Encoding times are substantial per patch due to iterative optimization but may be amortized where decoding dominates use cases (e.g., streaming platforms).
(Figure 8)
Figure 8: BD-rate versus MACs/pixel on UVG, displaying C3's orders-of-magnitude advantage in decoding complexity among competitive codecs.
Practical Considerations and Implementation Guidance
- Resource Requirements: Decoding is feasible even on constrained hardware; encoding remains slow but can be parallelized/evolved via meta-learning or better initialization.
- Deployment: Ideal for scenarios where encoding is infrequent but decoding latency or power is critical (e.g., mobile, edge, streaming).
- Scalability: Patchwise fitting enables handling of large images/videos; custom entropy masks and adaptive architecture sweeps maintain scalability as instance size grows.
- Limitations: Autoregressive entropy coding is serial, limiting hardware parallelism; encoding cost precludes real-time applications unless amortized.
- Potential Extensions: Faster encoding via meta-learning, parallel entropy models, partially sharing models across similar instances, or improved non-autoregressive entropy models could further boost practicality.
Conclusion
C3 demonstrates that overfitting small, sample-specific neural fields can deliver class-leading rate-distortion performance without general-purpose decoders, essentially bridging the gap between neural and classical low-complexity codecs for both images and videos. The methodology is robustly validated via ablations and extensive benchmarking. Future directions include accelerating encoding, further reducing serial dependencies in decoding, and exploring model sharing across instances for further compression gains.







Figure 4: Sample video frames and computed keypoint displacement, justifying custom context size and mask learning in entropy modeling for fast-motion video patches.