AVS-EEM: End-to-End Intelligent Video Coding Model

Updated 7 February 2026

The paper presents a dual-branch framework integrating neural motion estimation and residual coding to significantly enhance rate–distortion efficiency.
It employs an end-to-end training strategy with hierarchical quality optimization, ensuring compliance with AVS3 common test conditions and low computational cost.
Experimental results show notable BD-rate reductions over conventional AVS3 codecs, demonstrating the model’s practical potential for real-world deployment.

The AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) is a standardization project initiated by the AVS video coding working group to investigate and advance end-to-end learned video compression for real-world deployment. Developed from mid-2023 onward, AVS-EEM is distinguished by its focus on practical constraints—specifically, maintaining low computational complexity and strict compliance with the AVS3 common test conditions—and by its iterative integration of proposals from the broader research community. The framework adopts a modular architecture with conditional coding branches for motion and residuals, both optimized end-to-end, enabling substantial improvements in rate–distortion efficiency within a deployable computational budget. Recent results indicate that AVS-EEM surpasses conventional AVS3 reference software in compression efficiency under identical testing conditions (Sheng et al., 31 Jan 2026).

1. Motivations and Design Principles

AVS-EEM addresses stagnating coding gains from traditional hybrid codecs (e.g., AVS3), where further improvements require significant complexity for marginal rate–distortion (R–D) benefit. By leveraging neural end-to-end learning, AVS-EEM seeks to jointly optimize motion estimation, prediction, entropy coding, and reconstruction, overcoming the limitations of hand-crafted coding tools.

Key design targets include:

Encoder complexity ≤ 300 KMAC/pixel, decoder ≤ 200 KMAC/pixel.
Strict adherence to AVS3 common test conditions (CTC).
Two-branch structure separating motion and residual coding, maintaining the classical motion–residual paradigm for interpretability and modularity.
Standard-style, iterative development: progressing from version 0.1 (Sept 2023) to version 9.2 (Jan 2026) via community-driven architectural and algorithmic proposals.

2. Model Architecture and Modules

AVS-EEM v9.2 utilizes a conditional coding framework with discrete processes for motion and residual information:

Motion Branch: Motion estimation is performed with an optical flow network (FastFlowNet), pre-trained using PWC-Net and successively fine-tuned for reduced complexity. The motion field $v_t = \mathrm{FastFlowNet}(x_t, x_{t-1})$ is further transformed, quantized, and compressed using a hyper-prior entropy model grounded in Gaussian assumptions. Feature-domain group-wise alignment, based on splitting reference features and motion flow, refines compensation and temporal context creation.
Residual Branch: Residuals are encoded and modeled via stride-2 convolutions and residual blocks, with latent variables subjected to checkerboard autoregressive entropy coding for efficient bit allocation. Multi-scale temporal contexts aid residual encoding and are refined using confidence maps.
Frame Synthesis: The decoder reconstructs video frames by fusing coded residuals and temporal contexts within a U-Net equipped with sub-pixel convolutions, guided by learned confidence maps.

Process Overview

Stage	Description	Complexity (KMAC/pixel)
Motion Estimation	FastFlowNet, domain optical flow	27.44
Motion Encoder/Decoder	Stride-2 convolutions, residual, sub-pixel blocks	44.56 / 32.32
Motion Compensator	Group-wise feature alignment, context mining	36.10
Residual Encoder/Decoder	Stride-2 convolutions, context injection	24.77 / 36.10
Entropy Model Enc/Dec	Hyper-prior modeling, quantization	~5 each

Total pipeline: encoder 294.6 KMAC/pixel, decoder 175.1 KMAC/pixel (cf. MPAI-EEV ≈ 3127 KMAC/pixel) (Sheng et al., 31 Jan 2026).

3. Rate–Distortion Optimization and Training Workflow

The end-to-end optimization uses a composite loss per frame: $L_t = R_t + \lambda D_t$ with $R_t = R_t^m + R_t^r$ (bit cost for motion and residuals, including hyper-priors) and distortion $D_t = \|x_t - \hat x_t\|_2^2$ . The multi-stage training strategy includes:

Motion-only: Optimize motion coding using $L_1 = R_t^m + \lambda\,D_t^m$ .
Residual-only: Optimize residual coding with $L_2 = R_t^r + \lambda\,D_t^r$ .
Joint P-frame: Training with combined motion and residual loss.
Hierarchical quality: Introduces cyclic weighting for varying frame qualities across groups.
Multi-frame cascaded: Sequence-based training to simulate coding drift.

Training employs both short-span (Vimeo-90k, 7 frames) and long-sequence datasets (BVI-DVC, 64 frames; 30-frame 360p clips). Augmentation involves I-frame switching and repeated P-frame compressions to simulate drift and variable reference quality. Heavy-to-light flow training pre-trains with a heavier model (PWC-Net), then distills to the lightweight FastFlowNet to balance accuracy and complexity (Sheng et al., 31 Jan 2026).

4. Inference-Time Complexity Reduction Techniques

To obey stringent complexity budgets during inference, multiple optimizations are used:

Adaptive Downsampling for Motion Estimation: ME is performed at a lower resolution when the PSNR drop is insignificant, based on $\mathrm{PSNR}_d > \mathrm{PSNR}_o + \theta$ .
Adaptive Skipping in Entropy Coding: Latent variables with estimated variance $\sigma^2 < \eta_t$ skip entropy coding, replaced by the mean, with dynamic thresholding.
Decoded Feature Refresh: Periodically clears cached features ( $\hat F^v, \hat m, \hat y$ ) to prevent feature drift over long sequences.

These optimizations enable real-time operation within target budgets, facilitating deployment scenarios otherwise unmanageable for neural codecs (Sheng et al., 31 Jan 2026).

5. Experimental Results and Component Ablations

Testing under AVS3 CTC (low-delay-P, multi-rate, YUV420, compound PSNR 6:1:1, anchor: HPM-15.1) demonstrates:

Average BD-Rate Reductions (v9.2 vs. AVS3 HPM-15.1):
- Y (luma): –4.14%
- U (chroma): –9.58%
- V (chroma): –24.72%
Resolution-wise Results:
- 4K: Y=–3.58%, U=+28.6%, V=+2.73%
- 1080p: Y=–4.22%, U=–42.24%, V=–41.13%
- 720p: Y=–4.61%, U=–15.10%, V=–35.75%
Selected Sequence (Tango2, 4K): Y=–19.46%, U=–59.11%, V=–47.74%.
Ablation Studies: Key contributors to R–D improvement include content and motion feature conditioning (–15.9% Y), temporal context mining with confidence maps (–10.2% Y), and hierarchical quality training (–13.7% Y) (Sheng et al., 31 Jan 2026).

Performance has improved from v0.1 (+201% Y BD-rate) to v9.2 (–4.1% Y BD-rate relative to AVS3), demonstrating iterative gains as architectural and training innovations were added.

6. Standardization Roadmap and Future Directions

The future development of AVS-EEM involves:

Complexity Reduction: Further architectural pruning, variable-rate training, and fixed-point quantization to lower computational cost.
Random Access Support: Exploration of B-frame (bi-directional prediction) coding strategies.
Perceptual Optimization: Integration of GANs, knowledge distillation, and diffusion priors for perceptual quality at low bitrates.
System Integration: Advancement toward real-time, hardware-friendly implementations, including GPU and ASIC prototypes.

AVS-EEM thus constitutes a milestone in the application of end-to-end learning for standard-grade, low-complexity, and high-performance video coding under realistic operational constraints (Sheng et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM).

AVS-EEM: End-to-End Intelligent Video Coding Model

1. Motivations and Design Principles

2. Model Architecture and Modules

Process Overview

3. Rate–Distortion Optimization and Training Workflow

4. Inference-Time Complexity Reduction Techniques

5. Experimental Results and Component Ablations

6. Standardization Roadmap and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AVS-EEM: End-to-End Intelligent Video Coding Model

1. Motivations and Design Principles

2. Model Architecture and Modules

Process Overview

3. Rate–Distortion Optimization and Training Workflow

4. Inference-Time Complexity Reduction Techniques

5. Experimental Results and Component Ablations

6. Standardization Roadmap and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research