Video Coding for Machines

Updated 13 December 2025

Video Coding for Machines (VCM) is a paradigm that optimizes encoding for machine vision tasks by replacing traditional perceptual metrics with task-specific objectives such as mAP and MOTA.
VCM integrates both pixel- and feature-domain approaches with scalable, layered bitstream designs to efficiently transmit semantic content for applications like surveillance and smart cities.
Recent advances in VCM include hybrid pixel-feature codecs and end-to-end learnable models that achieve significant BD-rate reductions while maintaining high task performance.

Video Coding for Machines (VCM) encompasses algorithmic, architectural, and standardization advances in compressing and transmitting visual data to optimize the performance of automated machine-vision tasks, such as object detection, segmentation, tracking, and analytics, rather than for human visual consumption. In this paradigm, traditional perceptual-fidelity criteria are replaced or augmented by semantic or task-driven objectives, often resulting in fundamentally different codec designs, representations, and measurable rate–task trade-offs compared to human-oriented video coding. VCM integrates multiple research tracks—including feature coding, learning-based codecs, scalable bitstream design, and split inference pipelines—and is actively addressed in recent MPEG and MPEG-AI standardization activities.

1. Fundamental Concepts and Motivation

VCM arises from the divergence between classical video coding—optimized for preserving pixel fidelity and perceptual metrics such as PSNR, SSIM, or VMAF—and the requirements of machine vision systems, where only task-relevant information must be retained at the receiver. In conventional settings, visual signals are compressed, transmitted, and reconstructed for human viewing, and subsequent analytics pipelines are an afterthought, leading to bandwidth inefficiencies and degraded accuracy, especially at low bitrates. In contrast, VCM formulates the coding objective in terms of a rate–task metric, e.g., minimizing bit-rate $R$ while maximizing task accuracy $A$ (such as mAP for detection or MOTA for tracking), allowing a flexible trade-off expressed as $R + \lambda \cdot D_\text{task}$ , where $D_\text{task}$ quantifies deviation from task-optimal performance (Fischer et al., 2022, Eimon et al., 11 Dec 2025).

VCM thus encompasses both pixel-domain coding (where coding tools are tailored to preserve semantic content) and feature-domain (where intermediate representations, such as neural tensor activations, are the compressed payload). Crucial use cases include surveillance, smart city infrastructure, intelligent transportation, distributed sensor networks, and any application where machine-generated or -consumed video outpaces human viewing (Gao et al., 2021, Eimon et al., 11 Dec 2025).

2. Processing Pipelines, Formats, and Standards

VCM is structured around several canonical processing pipelines:

Video-only Pixel-domain Coding: The encoder compresses video in a perceptual or slightly task-aware manner; the downstream task runs on reconstructed frames.
Intermediate Feature Coding: The edge runs part of a DNN, producing compressed feature maps, which are transmitted and fed to a cloud-based task network.
Hybrid or Scalable Layered Coding: The bitstream is partitioned into a base layer (optimized for machines, often feature- or semantic descriptors) and one or more enhancement layers (enabling human reconstruction or higher-fidelity analytics) (Xia et al., 2020, Hadizadeh et al., 2023).

Recent standardization efforts formalize these pipelines:

MPEG VCM (Part 2) focuses on pixel-domain, rate–accuracy–optimized coding pipelines, inserting pre-processing (e.g., temporal/spatial/ROI sampling, bit-depth truncation) to eliminate redundancies invisible to the downstream model (Eimon et al., 11 Dec 2025).
MPEG-AI FCM (Part 4) specifies feature-domain codecs, in which high-dimensional neural activations are quantized, statically or adaptively packed, and compressed with modified VVC or specialized tools (Eimon et al., 9 Dec 2025, Eimon et al., 11 Dec 2025).

BD-Rate (Bjøntegaard-Delta Rate) remains the core metric for quantifying bitrate savings at iso-task-accuracy, with mAP, MOTA, or other task scores replacing conventional distortion measures (Fischer et al., 2022, Eimon et al., 11 Dec 2025).

3. Codec Architectures and Algorithmic Advances

VCM codec research develops along several major directions:

Hybrid Pixel + Feature Coding: Solutions such as NN-VVC combine a learned image codec for key frames with a VVC (or HEVC) inter-frame codec, using task-aligned loss functions (self-supervised proxy loss, perceptual loss) and adapters to suppress artifacts and tune references for both pixel and feature optimization. These approaches achieve up to –43.2% BD-rate in image and –26.8% in video against VVC, measured by mAP/MOTA (Ahonen et al., 19 Jan 2024).
Pure Feature Coding: Split-inference and feature coding frameworks (e.g., CompressAI-Vision, FCM) extract intermediate representations from DNN backbones (e.g., FPN layers, Darknet-53), prune and quantize them, and compress via pseudo-video formats targeted to the feature statistics. Packing, quantization, and entropy-coding are fine-tuned for sparse, task-specific activations rather than perceptual redundancy. Bitrate reductions of 90–97% over remote inference (pixel streaming) have been reported, with near-edge model accuracy preserved (Eimon et al., 11 Dec 2025).
End-to-End Learnable and Task-Harmonized Codecs: Many VCM models replace hand-crafted module boundaries with end-to-end differentiable architectures, where rate–task Lagrangians are implemented with feature-matching, adversarial (GAN), or direct task-losses. For instance, multi-scale feature compressors interleave fusion and encoding across FPN levels to eliminate redundancy and show an order-of-magnitude BD-rate reduction over VVC encoded features (Kim et al., 2023).
Unified Semantic Compression: New frameworks enable the codec to align bidirectionally with frozen or trainable visual backbones (e.g., Swin Transformers). Symmetric entropy-constrained coding, as in SEC-VCM, imposes multi-scale alignment between codec-feature-space and foundation-model feature spaces, promoting transferability and multi-task readiness. Dual-path fusion modules inject pixel-level detail where needed, compensating for overly semantic compression to harmonize downstream task performance across detection, segmentation, and tracking (Sun et al., 17 Oct 2025).
Scalable and Layered Representations: Several systems implement physically split or functionally layered bitstreams: a minimal base layer for analytics (edge maps, sparse motion descriptors) plus one or more enhancement layers for partial or full signal reconstruction. This design pattern generalizes from images to video, enabling variable service profiles and fine-grained bitrate control (Xia et al., 2020, Hadizadeh et al., 2023, Hu et al., 2020).

4. Evaluation Metrics, Datasets, and Benchmarks

VCM necessitates new evaluation protocols:

Task-driven Metrics: Standard detection (mAP@[0.5:0.95]), segmentation (mean IoU), tracking (MOTA), and action-pose metrics replace or complement pixel fidelity.
Datasets: Large annotated datasets such as TVD (Tencent Video Dataset, 86 sequences at 4K, partial annotations), SFU-HW, OpenImages V6, and Cityscapes provide a rigorous basis for establishing anchor results. The MPEG VCM group defines common test conditions, fixed task-models (Faster R-CNN, Mask R-CNN, JDE), and benchmarking pipelines (Xu et al., 2021, Eimon et al., 11 Dec 2025, Gao et al., 2021).
Pseudo-GT Evaluation: In response to limited annotated video, pseudo-ground-truth evaluation measures task performance using a model’s own inference on uncompressed frames as a reference. This provides BD-rate results with <0.7pp error compared to true ground-truths at mid-rate, enabling large-scale VCM benchmarking without full human annotation (Fischer et al., 2022).
Semantic Perceptual Metrics: Metrics such as Satisfied Machine Ratio (SMR), aggregating the proportion of model-library instances maintaining consistent predictions or detection above thresholds, offer a statistical, model-agnostic rate–quality curve and serve as new QP selection targets (Zhang et al., 2022).
Just Recognizable Difference for Machines: New learning-based models, such as DT-JRD, predict the minimum quantization or distortion at which model performance degrades, allowing content-adaptive QP selection and up to ~30% bitrate reduction with maintained detection accuracy (Liu et al., 14 Nov 2024).

5. Toolchain, Coding Profiles, and Interoperability

A substantial body of work addresses adapting legacy codecs (VVC, HEVC, AVC) to the feature-coding use case:

Profile Optimization: Ablation and tool-level analyses revealed that many VVC tools (advanced motion, affine, loop filters, rare intra modes) are rarely utilized by RDO on feature “videos.” Removing or restricting such tools yields "Fast," "Faster," and "Fastest" VVC profiles, delivering up to –2.96% BD-rate gain and 95.6% encoding time reduction with <0.5pp accuracy loss on detection/tracking (Eimon et al., 9 Dec 2025).
Interoperability: Successful VCM approaches, such as NN-VVC, preserve full bitstream compatibility with standard VVC decoders (through re-packing or flattening), facilitating standardization and deployment in hybrid or transitional scenarios (Ahonen et al., 19 Jan 2024). FCM systems demonstrate that HEVC and VVC are almost equivalent for feature compression, with <2% BD-rate difference, enabling broad hardware and ecosystem support without accuracy penalties (Eimon et al., 11 Dec 2025).
Split Inference and Privacy: FCM accelerates and secures distributed analytics by never transmitting raw pixels, instead streaming obfuscated activation maps. Compute offload is facilitated by edge/cloud division along the network split, supporting progressive analytics and privacy preservation (Eimon et al., 11 Dec 2025).

6. Open Challenges and Future Directions

Multiple open questions persist in the VCM domain:

Multi-task and Foundation Model Support: Generalizing codecs for multi-task pipelines (e.g., supporting detection and segmentation with the same bitstream), and exploiting foundation models and their semantics as universal feature spaces for unified compression (Sun et al., 17 Oct 2025, Yang et al., 2021).
Semantic Distortion Modeling: Quantitative metrics for semantic or task distortion, enabling rate–distortion optimization over a set of downstream models or over future model upgrades (Eimon et al., 11 Dec 2025, Zhang et al., 2022). Universal distortion metrics that span human- and machine-centric profiles remain an open research problem.
Scalability and Adaptivity: Layered bitstreams and scalable codecs that adapt dynamically to requested use (analytics-only, hybrid, human review), and support variable and content-adaptive bitrate provisioning (Hu et al., 2020, Hadizadeh et al., 2023).
Standardization: Ongoing MPEG and MPEG-AI initiatives are defining interoperable bitstream syntax, encoding tools, reference software, and objective evaluation protocols for VCM and FCM (Gao et al., 2021, Eimon et al., 11 Dec 2025, Eimon et al., 9 Dec 2025).
Efficiency vs. Complexity: Trade-offs between compression efficiency, computational complexity, and inference latency, particularly for real-time, edge-deployed VCM systems; lightweight, parallelizable models are an area of active investigation (Eimon et al., 9 Dec 2025, Kim et al., 2023).
Privacy, Security, and Robustness: Ensuring feature bitstreams carry only task-essential content and resist leakage of sensitive details; robustifying codecs against domain shifts and adversarial attacks (Eimon et al., 11 Dec 2025, Zhang et al., 2022).

VCM thus defines a rapidly advancing research and standards track at the intersection of video coding, deep learning, and large-scale automated analytics, with its central goal being the efficient, scalable, and secure transport of visual information optimized for the demands of machine consumers.