AVS-EEM: End-to-End Intelligent Video Coding Model
- The paper presents a dual-branch framework integrating neural motion estimation and residual coding to significantly enhance rate–distortion efficiency.
- It employs an end-to-end training strategy with hierarchical quality optimization, ensuring compliance with AVS3 common test conditions and low computational cost.
- Experimental results show notable BD-rate reductions over conventional AVS3 codecs, demonstrating the model’s practical potential for real-world deployment.
The AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) is a standardization project initiated by the AVS video coding working group to investigate and advance end-to-end learned video compression for real-world deployment. Developed from mid-2023 onward, AVS-EEM is distinguished by its focus on practical constraints—specifically, maintaining low computational complexity and strict compliance with the AVS3 common test conditions—and by its iterative integration of proposals from the broader research community. The framework adopts a modular architecture with conditional coding branches for motion and residuals, both optimized end-to-end, enabling substantial improvements in rate–distortion efficiency within a deployable computational budget. Recent results indicate that AVS-EEM surpasses conventional AVS3 reference software in compression efficiency under identical testing conditions (Sheng et al., 31 Jan 2026).
1. Motivations and Design Principles
AVS-EEM addresses stagnating coding gains from traditional hybrid codecs (e.g., AVS3), where further improvements require significant complexity for marginal rate–distortion (R–D) benefit. By leveraging neural end-to-end learning, AVS-EEM seeks to jointly optimize motion estimation, prediction, entropy coding, and reconstruction, overcoming the limitations of hand-crafted coding tools.
Key design targets include:
- Encoder complexity ≤ 300 KMAC/pixel, decoder ≤ 200 KMAC/pixel.
- Strict adherence to AVS3 common test conditions (CTC).
- Two-branch structure separating motion and residual coding, maintaining the classical motion–residual paradigm for interpretability and modularity.
- Standard-style, iterative development: progressing from version 0.1 (Sept 2023) to version 9.2 (Jan 2026) via community-driven architectural and algorithmic proposals.
2. Model Architecture and Modules
AVS-EEM v9.2 utilizes a conditional coding framework with discrete processes for motion and residual information:
- Motion Branch: Motion estimation is performed with an optical flow network (FastFlowNet), pre-trained using PWC-Net and successively fine-tuned for reduced complexity. The motion field is further transformed, quantized, and compressed using a hyper-prior entropy model grounded in Gaussian assumptions. Feature-domain group-wise alignment, based on splitting reference features and motion flow, refines compensation and temporal context creation.
- Residual Branch: Residuals are encoded and modeled via stride-2 convolutions and residual blocks, with latent variables subjected to checkerboard autoregressive entropy coding for efficient bit allocation. Multi-scale temporal contexts aid residual encoding and are refined using confidence maps.
- Frame Synthesis: The decoder reconstructs video frames by fusing coded residuals and temporal contexts within a U-Net equipped with sub-pixel convolutions, guided by learned confidence maps.
Process Overview
| Stage | Description | Complexity (KMAC/pixel) |
|---|---|---|
| Motion Estimation | FastFlowNet, domain optical flow | 27.44 |
| Motion Encoder/Decoder | Stride-2 convolutions, residual, sub-pixel blocks | 44.56 / 32.32 |
| Motion Compensator | Group-wise feature alignment, context mining | 36.10 |
| Residual Encoder/Decoder | Stride-2 convolutions, context injection | 24.77 / 36.10 |
| Entropy Model Enc/Dec | Hyper-prior modeling, quantization | ~5 each |
Total pipeline: encoder 294.6 KMAC/pixel, decoder 175.1 KMAC/pixel (cf. MPAI-EEV ≈ 3127 KMAC/pixel) (Sheng et al., 31 Jan 2026).
3. Rate–Distortion Optimization and Training Workflow
The end-to-end optimization uses a composite loss per frame: with (bit cost for motion and residuals, including hyper-priors) and distortion . The multi-stage training strategy includes:
- Motion-only: Optimize motion coding using .
- Residual-only: Optimize residual coding with .
- Joint P-frame: Training with combined motion and residual loss.
- Hierarchical quality: Introduces cyclic weighting for varying frame qualities across groups.
- Multi-frame cascaded: Sequence-based training to simulate coding drift.
Training employs both short-span (Vimeo-90k, 7 frames) and long-sequence datasets (BVI-DVC, 64 frames; 30-frame 360p clips). Augmentation involves I-frame switching and repeated P-frame compressions to simulate drift and variable reference quality. Heavy-to-light flow training pre-trains with a heavier model (PWC-Net), then distills to the lightweight FastFlowNet to balance accuracy and complexity (Sheng et al., 31 Jan 2026).
4. Inference-Time Complexity Reduction Techniques
To obey stringent complexity budgets during inference, multiple optimizations are used:
- Adaptive Downsampling for Motion Estimation: ME is performed at a lower resolution when the PSNR drop is insignificant, based on .
- Adaptive Skipping in Entropy Coding: Latent variables with estimated variance skip entropy coding, replaced by the mean, with dynamic thresholding.
- Decoded Feature Refresh: Periodically clears cached features () to prevent feature drift over long sequences.
These optimizations enable real-time operation within target budgets, facilitating deployment scenarios otherwise unmanageable for neural codecs (Sheng et al., 31 Jan 2026).
5. Experimental Results and Component Ablations
Testing under AVS3 CTC (low-delay-P, multi-rate, YUV420, compound PSNR 6:1:1, anchor: HPM-15.1) demonstrates:
- Average BD-Rate Reductions (v9.2 vs. AVS3 HPM-15.1):
- Y (luma): –4.14%
- U (chroma): –9.58%
- V (chroma): –24.72%
- Resolution-wise Results:
- 4K: Y=–3.58%, U=+28.6%, V=+2.73%
- 1080p: Y=–4.22%, U=–42.24%, V=–41.13%
- 720p: Y=–4.61%, U=–15.10%, V=–35.75%
- Selected Sequence (Tango2, 4K): Y=–19.46%, U=–59.11%, V=–47.74%.
- Ablation Studies: Key contributors to R–D improvement include content and motion feature conditioning (–15.9% Y), temporal context mining with confidence maps (–10.2% Y), and hierarchical quality training (–13.7% Y) (Sheng et al., 31 Jan 2026).
Performance has improved from v0.1 (+201% Y BD-rate) to v9.2 (–4.1% Y BD-rate relative to AVS3), demonstrating iterative gains as architectural and training innovations were added.
6. Standardization Roadmap and Future Directions
The future development of AVS-EEM involves:
- Complexity Reduction: Further architectural pruning, variable-rate training, and fixed-point quantization to lower computational cost.
- Random Access Support: Exploration of B-frame (bi-directional prediction) coding strategies.
- Perceptual Optimization: Integration of GANs, knowledge distillation, and diffusion priors for perceptual quality at low bitrates.
- System Integration: Advancement toward real-time, hardware-friendly implementations, including GPU and ASIC prototypes.
AVS-EEM thus constitutes a milestone in the application of end-to-end learning for standard-grade, low-complexity, and high-performance video coding under realistic operational constraints (Sheng et al., 31 Jan 2026).