COOL-CHIC: Hierarchical Neural Codecs
- COOL-CHIC is a framework of overfitted, coordinate-based neural codecs designed for efficient image and video compression using hierarchical latent representations.
- It employs a lightweight, instance-adaptive decoder with auto-regressive entropy coding to achieve competitive rate–distortion performance and high perceptual quality.
- The video codec extension integrates advanced motion compensation and block-based filtering, significantly reducing complexity while maintaining competitive BD-rate reductions.
The COOL-CHIC framework refers to a family of overfitted, coordinate-based, hierarchical neural codecs initially developed for learned image compression and later extended to video coding applications. Distinguished by their low decoder complexity, competitive rate–distortion performance, and instance-adaptive, neural network-based methodology, these codecs integrate hierarchical latent representations with lightweight decoders, efficient entropy coding, and, for video, advanced motion compensation paradigms. The following sections analyze the central design principles, algorithmic components, rate–distortion optimization strategies, architectural and efficiency trade-offs, adaptation for perceptual coding, and recent innovations for video compression.
1. Architecture and Core Methodology
COOL-CHIC departs from conventional autoencoder-based codecs by implementing a coordinate-based neural representation (CNR) overfitted to each image. The codec comprises several key modules:
- Hierarchical Latent Representation: For each image, a latent pyramid is constructed across multiple spatial resolutions (e.g., seven from full resolution down to 1/64 scale). These latent channels are upsampled (bicubic interpolation or learnable upsampling kernels such as 8×8 kernels) and concatenated to form a dense representation.
- Overfitted Decoder: A lightweight neural decoder (usually a small MLP, sometimes followed by residual convolutions) is tailored for each image through direct optimization. Decoder parameters and latent variables are transmitted in the bitstream, serving dual roles as both the reconstruction mechanism and part of the compressed code.
- Entropy Coding: An auto-regressive probability model, typically a small neural network, predicts the mean and variance for latent values conditioned on a local context (e.g., 24 nearest preceding latents for images; extended contexts for video). Quantized latents and decoder weights are entropy-coded via arithmetic range coding based on Laplace or parameterized distributions.
- Decoder Complexity: Decoder architectures are engineered for efficiency, spanning configurations from ultra-light (300 multiplications per pixel) to richer designs (~2200 MAC/pixel). Even the least complex variant outperforms established codecs such as HEVC (Blard et al., 18 Mar 2024).
2. Rate–Distortion Optimization and Mathematical Formulation
The primary optimization target is the Lagrangian rate–distortion objective:
where is a distortion metric (typically MSE and, for perceptual tuning, combined with MS-SSIM), is the rate-distortion multiplier, and represents the bit cost for encoding latent codes (and, occasionally, model parameters).
Distortion is measured in both pixel-based (MSE, PSNR) and perceptual (MS-SSIM, VMAF) metrics:
with typically empirically set (e.g., ).
Latent variables and decoder weights are quantized; during training, quantization is initially approximated by adding noise (uniform or Kumaraswamy-sampled) and, for improved optimization stability, by soft rounding with a temperature-annealed function. Post-training, hard quantization and entropy coding are employed (Kim et al., 2023).
3. Hierarchical Latent Design and Lightweight Neural Processing
The hierarchical latent scheme structures a multi-scale representation analogous to classical wavelet decompositions. Each resolution channel captures distinct spatial-frequency content; upsampling and concatenation ensure the decoder receives a rich, multi-scale signal. Typical decoder flows:
- [Latent Pyramid] → [Upsampling Cascade] → [First MLP (ReLU, 40 features)] → [Second MLP (3 outputs, ReLU)] → [Residual 3×3 Convolutions] → [Final RGB output].
Decoder complexity is minimized through limited depth, small hidden sizes, and selective use of convolutional layers. Optimization zeroes out many latent values, further sparsifying the bitstream.
Auto-regressive entropy decoders parallelize decoding across channels, using context neural networks to estimate distribution parameters for each latent code (Ladune et al., 4 Jan 2024).
4. Perceptual Coding: Metrics and Loss
To improve subjective image quality, perceptual metrics are directly incorporated:
- MS-SSIM, which penalizes inability to reconstruct multi-scale structural similarity, is combined with MSE in the loss.
- VMAF guides bitstream selection: for a target bpp, the coding level maximizing the minimum VMAF score across the dataset is chosen.
This perceptual tuning ensures subjective fidelity for images across a bitrate spectrum, with performance competitive to AVIF and HEVC per VMAF, PSNR, and MS-SSIM evaluations (Ladune et al., 4 Jan 2024).
5. Complexity Reduction and Practical Implementations
Encoding complexity, determined by iterations of gradient descent (“overfitting”), can be substantial. The authors propose:
- Shortened training: Reduced number of optimization iterations lowers computation substantially—a factor of up to 1000× when skipping overfitting altogether in favor of a single-shot analysis transform.
- Non-overfitted Cool-chic variant: An additional analysis network generates latents in a single forward pass; rate–distortion minimization proceeds via ensemble expectation.
Decoder architectures are further streamlined; switching to binary arithmetic coding and implementing ARM modules in highly optimized CPU instructions (AVX2) enables near real-time decoding (∼100 ms per image) for even the more complex models, and ∼50 ms for the ARM module (Blard et al., 18 Mar 2024). This approach makes deployment on conventional CPUs and mobile devices tractable.
Decoder Configuration | MAC/pixel | RD Performance |
---|---|---|
Tiny (ARM only) | 300 | Surpasses HEVC |
Standard | 2300 | Competitive with VVC |
6. Video Codec Extension: Motion Compensation and Efficiency
The extension to video incorporates advanced sub-pixel motion compensation:
- Interpolation Filters: Instead of simple bilinear interpolation (2-tap), Cool-chic uses longer N-tap filters (e.g., 8-tap, windowed sinc-based) as in HEVC/VVC.
- Block-Based Motion: Fractional motion vectors are quantized (e.g., 64 fractional values), shared over pixel blocks (e.g., B×B, with B=4), enabling reuse of filtering computations and lowering complexity.
- 2D Filtering Cost: Block-wise filtering lowers motion-handling decoding complexity from 391 MAC/pixel (pixel-wise, bilinear) to 214 MAC/pixel (block-wise, 8-tap) (Ladune et al., 29 Jul 2025).
Compression efficiency is markedly increased—Cool-chic’s BD-rate gap versus HEVC is reduced from 41.9% to 14.3% with these improvements. All algorithmic contributions are publicly open-source (Ladune et al., 29 Jul 2025).
7. Evaluation and Benchmarks
COOL-CHIC demonstrates competitive performance versus established codecs across several benchmarks:
- Image coding: Compression performance is on par with standards like HEVC and VVC. Even the 300 MAC/pixel configuration outperforms HEVC in rate–distortion (Blard et al., 18 Mar 2024).
- Perceptual quality: Maintains high VMAF and MS-SSIM scores at practical bitrates.
- Video: BD-rate reductions of over 10%, with decoder complexity comparable or better than conventional codecs using block-based, quantized motion compensation (Ladune et al., 29 Jul 2025).
A plausible implication is that the COOL-CHIC family of codecs can provide competitive compression for both images and video with minimal hardware requirements, supporting practical deployment on consumer and mobile platforms. Ongoing developments focus on entropy coding refinement, optimized perceptual metrics, and architectural advances for further reductions in bitrate and complexity.