MPEG-AI Feature Coding for Machines

Updated 7 March 2026

MPEG-AI Feature Coding for Machines (FCM) is a framework that compresses intermediate deep neural network features for machine-centric visual analytics.
It optimizes the rate–accuracy trade-off for tasks like detection, segmentation, and tracking using both classical codecs and advanced learned methods.
FCM employs a split-inference architecture with standardized bitstream syntax to ensure reproducible performance and interoperability across devices and cloud systems.

Feature Coding for Machines (FCM) is a standardized framework under the MPEG-AI umbrella, targeting the efficient compression and transmission of intermediate deep neural network (DNN) features for vision-centric applications in both edge–cloud collaboration and distributed inference. Unlike traditional visual coding standards focused on perceptual quality for humans, FCM is fundamentally designed for machine inference, optimizing the rate–accuracy trade-off by minimizing the overall bitstream required to preserve key performance metrics (e.g., mAP, MOTA) for downstream tasks such as detection, segmentation, and tracking. FCM provides a reproducible and interoperable infrastructure, leveraging both classical and learning-based compression methods, with a normative evaluation environment and bitstream syntax that systematically address the requirements of intelligent visual analytics.

1. Core Objectives and Rate–Accuracy Formulation

The central goal of MPEG-AI FCM is to optimize the transmission of visual information such that the total bit-rate $R$ to encode inference inputs—either raw pixels or intermediate network features—is minimized, subject to constraints on the downstream task performance. The standard formalizes this with the joint cost:

$\min_{\theta,q}\;\mathbb{E}[R(\theta,q)]\;+\;\lambda\,\mathbb{E}[L_{\text{task}}(\hat y(F; \theta), y_{\text{gt}})]$

where $\theta$ are codec and network parameters, $q$ is the quantization step, $F$ is the pre-coded feature tensor, $\hat y$ is the task output from the inference network, and $L_{\text{task}}$ reflects downstream loss (e.g., cross-entropy or detection-loss) (Choi et al., 25 Sep 2025). Additional regularization can be incorporated to limit feature distortion:

$L_{\text{feat}} = \|F - \hat F\|_2^2$

yielding composite objectives to jointly minimize $R$ , $L_{\text{task}}$ , and $L_{\text{feat}}$ as needed for specific deployment cases. This rate–accuracy paradigm distinguishes FCM from perceptually-oriented schemes and anchors its evaluation metrics to task-level accuracy rather than pixel-domain fidelity.

2. System Architecture and Standardized Bitstream

FCM standardizes a split-inference architecture. Early layers (“NN-Part 1”) execute on the edge device, extracting feature tensors $X\in \mathbb{R}^{C\times H\times W}$ . These features are processed through a reduction and packing pipeline:

Feature Reduction (FENet): Maps multiple feature layers to a low-dimensional tensor $x_f$ using convolutional, residual, and attention modules, with optional per-channel gain scaling.
Packing and Quantization: Channels are tiled into 2D images and linearly quantized (typically 10 bits/sample). Auxiliary parameters (min, max, gain, shape) are side-coded.
Entropy Coding (Inner Codec): Quantized frames are compressed using existing or new video codecs (VVC, HEVC, AVC, or machine-optimized VCM-RS), producing a compliant bitstream.
Metadata and Bitstream Syntax: All parameters necessary for tensor restoration (e.g., statistics, activity maps, pruning masks) are signaled in dedicated Network Abstraction Layer (NAL) units, preceding the main payload (Eimon et al., 10 Dec 2025, Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

The server decodes, unpacks, reconstructs, and applies a restoration network (DRNet) to recover features, handing them to the back-end network for inference.

3. Coding Tools and Advanced Compression Techniques

The FCM ecosystem encompasses both conventional and learned compression tools, with prominent techniques including:

Classical Codecs: Established standards such as VVC, HEVC, and AVC, serving as inner codecs following feature packing and quantization.
Learned Compressors: Architectures employing hyperprior models, joint autoregressive-hyperprior structures, and end-to-end differentiable encoders/decoders (e.g., as in CompressAI).
Multi-Scale Feature Fusion: Fusion and compression schemes that interleave encoding and feature combination to exploit cross-scale redundancy (e.g., Kim et al., L-MSFC) and achieve significant BD-rate reductions compared to VVC-based anchors (Kim et al., 2023).
Channel Truncation and Packing: Methods that adaptively prune low-importance channels and re-pack remaining channels with minimal overhead, yielding substantial rate savings at fixed task accuracy (Merlos et al., 11 Dec 2025).
Global Statistics Preservation: Z-score normalization and re-coloring at the decoder, with per-tensor or aggregated mean/std signaling, enable near-lossless restoration of feature statistics and up to 66% additional bitrate savings for tracking (Eimon et al., 10 Dec 2025).
Bit Allocation and Importance Modeling: Algorithms such as MFIBA permit dynamic, per-scale rate allocation based on predicted feature-task importance, supporting adaptive optimization of multi-scale representations (Liu et al., 25 Mar 2025).

These toolsets are systematically benchmarked using the CompressAI-Vision platform, which supports full pipeline specification, reproducibility, and deployment-level evaluation (Choi et al., 25 Sep 2025).

4. Evaluation Methodology and Empirical Performance

Evaluation of FCM schemes employs standardized datasets and task metrics:

Datasets: OpenImages, FLIR, SFU-HW-Obj, TVD, HiEVE.
Tasks: Instance segmentation (COCO mAP), object detection (mAP), multi-object tracking (MOTA), pose estimation.
Metrics: Rate–accuracy curves (accuracy vs. bpp), Bjøntegaard Delta-Bitrate (BD-Rate) computed at equal accuracy relative to various baselines (local, remote, VCM-RS).
Scenarios: Local (full inference on device), remote (pixel transmission), split (intermediate feature coding) (Choi et al., 25 Sep 2025, Eimon et al., 10 Dec 2025).

Selected results demonstrate the impact:

Codec/Method	Dataset/Task	BD-Rate Savings
FCM (VTM anchor)	SFU-HW-Obj (det., RA+e2e)	–79.35%
FCM (overall)	All tasks/datasets vs. remote	–75.90%
L-MSFC (learned)	Object detection (OpenImagesV6)	–93.95%
Z-score preservation	HiEve-1080p (tracking)	–65.69%
MFIBA+ELIC (adaptive)	COCO detection	–38.20%

In general, FCM approaches render 75–95% bitrate reductions for equivalent mAP or MOTA relative to pixel-based transmission, with negligible accuracy loss. For encoding pipelines, decoder complexity is consistently lower than that of the back-end network, while encoder overhead (especially for learned or hybrid pipelines) is nontrivial and remains an active area for hardware and architectural optimization (Eimon et al., 11 Dec 2025, Eimon et al., 10 Dec 2025).

5. Specialized Codec Profiles, Standardization, and Interoperability

Emerging MPEG-AI FCM profiles exploit insights from coding tool ablations:

Task-Centric Codecs: VVC profiles (“Fast”, “Faster”, “Fastest”) disable perceptual tools (e.g., in-loop filters, certain intra/inter-prediction modes) to improve throughput or reduce bitrate, confirming that components targeting human perception are suboptimal for intermediate feature coding (Eimon et al., 9 Dec 2025).
Hardware Backward Compatibility: HEVC and AVC hardware can be deployed without significant loss in task performance, with HEVC achieving near-identical results to VVC (BD-Rate +1.39% for detection/tracking) but with widespread hardware availability (Eimon et al., 11 Dec 2025).
Normative Test Models and Reference Pipelines: The Feature Coding Test Model (FCTM) delineates standard-compliant codecs and benchmarking setups, defining bitstream syntax, feature reduction modules, and quantization interfaces, enabling any FCM-compliant encoder/decoder to interoperate irrespective of the underlying DNN or split point (Eimon et al., 11 Dec 2025, Choi et al., 25 Sep 2025).

Interoperability across device and cloud is a cornerstone, simplifying deployment in large-scale, heterogeneous environments (e.g., smart cities), and facilitating collaborative intelligence, privacy preservation (as intermediate features obfuscate original appearance), and resource-adaptive compute offloading (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

6. Open Challenges, Research Directions, and Extension Points

Key areas of ongoing and future research within MPEG-AI FCM include:

Optimal Split Point and Feature Selection: Information-theoretic analysis shows deeper-layer feature matching in the DNN yields better rate–distortion performance for a fixed task accuracy; practical bitstream syntax should signal the feature layer used to ensure correct receiver-side integration (Harell et al., 2022).
End-to-End Learnability and Task Loss Integration: Fully differentiable pipelines (L-MSFC) suggest improved codec adaptation to feature statistics and cross-scale dependencies. Co-optimizing for downstream loss, perceptual, or adversarial objectives is feasible (Kim et al., 2023).
Advanced Bit Allocation: Incorporating per-scale or per-instance feature importance models allows even finer rate–accuracy trade-offs, especially in object-centric or multi-task scenarios (Liu et al., 25 Mar 2025).
Standardization of Feature Preservation Tools: Wrapper-aware RDO, input-dependent squared error metrics, and related side-information (e.g., Jacobian sketches, importance maps) offer encoder-side, block-level optimization without requiring decoder modifications, and can be seamlessly integrated as optional enhancement layers (Fernández-Menduiña et al., 3 Apr 2025, Fernández-Menduiña et al., 29 Jan 2026).
Beyond Gaussian Feature Models: Exploration of more complex feature statistics (e.g., higher-moment re-coloring) and dynamic refresh signaling can further close the rate–accuracy gap, especially at extreme compression rates (Eimon et al., 10 Dec 2025).
Human–Machine Scalable Coding: Two-layer and multi-layer architectures (base feature + enhancement for human visual quality) provide a unified framework for hybrid machine/human consumption (Hu et al., 2020, Xia et al., 2020).

FCM continues to evolve as a flexible, extensible standard, aligning codec development with the needs of machine-driven vision at massive scale, and enabling future integration of generative reconstruction, multi-task/ROI flexible streams, and bio-inspired sensor event coding within a single interoperable protocol.

References:

(Choi et al., 25 Sep 2025, Eimon et al., 10 Dec 2025, Eimon et al., 10 Dec 2025, Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025, Kim et al., 2023, Eimon et al., 9 Dec 2025, Liu et al., 25 Mar 2025, Harell et al., 2022, Fernández-Menduiña et al., 29 Jan 2026, Fernández-Menduiña et al., 3 Apr 2025, Hu et al., 2020, Xia et al., 2020)