Papers
Topics
Authors
Recent
Search
2000 character limit reached

ROI-Packing: Efficient Region-Based Compression

Updated 17 December 2025
  • ROI-Packing is an image compression framework that identifies and packs regions of interest to reduce bitrate without compromising task accuracy.
  • It employs ROI detection, padding, convex hull merging, and greedy bin-packing to efficiently encode key subregions for detection and segmentation.
  • Empirical evaluations show up to 44.10% bitrate reduction and improved mAP, all without retraining existing vision models.

ROI-Packing is an efficient region-based image compression framework designed for machine vision contexts where downstream task performance is of primary importance. The approach uses explicit identification and selection of Regions of Interest (ROIs) in an input image, packs these regions into a compact representation, and leverages standard codecs for transmission, enabling substantial reductions in bitrate with minimal loss—or in some cases improvement—in downstream task accuracy. ROI-Packing operates without the need to retrain or fine-tune end-task models and supports both detection and segmentation pipelines (Eimon et al., 10 Dec 2025).

1. Pipeline Structure and Operation

ROI-Packing comprises two main modules: an encoder (typically on an edge device) and a decoder (often deployed server-side). The process begins with ROI extraction from the input image XX, targeting regions critical to the target visual task (e.g., object detection, instance segmentation). The key steps are as follows:

  1. ROI Detection: Use of a region detector (e.g., YOLOv7), which outputs bounding boxes B={b1,...,bn}B = \{ b_1, ..., b_n \}.
  2. Region Padding: Expansion of each detected box bib_i by pp pixels (with p=15p=15 as the default).
  3. Overlap Merging: Construction of the convex hull over all corners of the padded boxes, followed by alignment of the hull to a grid corresponding to the encoder’s coding units (e.g., 16×1616 \times 16).
  4. Slicing and Merging: Split the aligned convex hull into rectangular sub-boxes by slicing at polygon vertices and merging adjacent slices to maximize the rectangle sizes.
  5. Region Scaling: Optional downscaling of less critical regions (background, objects below a certain size or importance threshold).
  6. Packing: Bin-packing all sub-boxes into a single packed frame of size (Wbin,Hbin)(W_{\text{bin}}, H_{\text{bin}}) using a greedy best-fit heuristic.
  7. Encoding and Transmission: Convert the packed image to YUV, encode with a standard All-Intra profile (e.g., VVC), and multiplex metadata (positions, sizes, scales) for transmission.

The decoder performs the inverse steps: after standard decoding, it unpacks the received rectangles, optionally rescales subregions, pastes them into the original image canvas, fills background with zeros, and passes the result to the target vision model.

2. Mathematical Formulation

2.1 ROI Representation and Merging

Each detected ROI RiR_i is described by top-left coordinates (xi,yi)(x_i, y_i) and dimensions wi,hiw_i, h_i, with four key points: v1i=(xi,yi),v2i=(xi+wi,yi),v3i=(xi,yi+hi),v4i=(xi+wi,yi+hi)v_1^i = (x_i, y_i), \quad v_2^i = (x_i+w_i, y_i),\quad v_3^i = (x_i, y_i+h_i),\quad v_4^i = (x_i+w_i, y_i+h_i) with P=i=1n{v1i,,v4i}P = \bigcup_{i=1}^n \{v_1^i,\ldots,v_4^i\}. The convex hull CH(P)CH(P) produces the minimal enclosing polygon, which is then grid-aligned. Rectangular sub-boxes are constructed by greedy slicing to maximize packing efficiency.

2.2 Packing Optimization

With mm final rectangles Rj=(wj,hj)R_j = (w_j, h_j), each placed at position (xj,yj)(x_j, y_j), packing must satisfy: xj+wjWbin,yj+hjHbin,andRiRj= ijx_j + w_j \leq W_{\text{bin}},\quad y_j + h_j \leq H_{\text{bin}},\quad \text{and}\quad R_i \cap R_j = \emptyset\ \forall i \neq j A best-fit bin-packing heuristic is used to minimize the unused area: Sunused=WbinHbinplaced k(wkhk)S_{\text{unused}} = W_{\text{bin}} \cdot H_{\text{bin}} - \sum_{\text{placed } k} (w_k \cdot h_k)

2.3 Rate–Distortion-Accuracy Trade-off

While not explicitly Lagrangian, the optimization can be framed as: minRtotalsubject to Dtask(Y^,X)Dmax\min R_{\text{total}} \quad \text{subject to } D_{\text{task}} (\hat{Y}, X) \leq D_{\text{max}} or, equivalently,

minDtask+λRtotal\min D_{\text{task}} + \lambda R_{\text{total}}

where RtotalR_{\text{total}} is the total bitrate (including metadata), and DtaskD_{\text{task}} is application-specific distortion, typically the accuracy drop in downstream tasks (e.g., mAP). Empirical metrics reported include BD-Rate (bitrate change for constant accuracy) and BD-mAP (accuracy change for constant bitrate).

3. Implementation: Pseudocode and Workflow

Encoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
B_boxes = D.detect(X)                # ROI detection (e.g., YOLOv7)
for b in B_boxes:
    b = expand(b, p)                 # Padding
P = union_of_corners(B_boxes)
Hull = convex_hull(P)
AlignedHull = align_to_grid(Hull, G)
Rects = slice_and_merge(AlignedHull)
for R in Rects:
    R = scale_if_needed(R, S)
Packed = bin_pack(Rects, W_bin, H_bin)
YUV = rgb2yuv(Packed)
B_frame = VVC_encode(YUV, QP)
Meta = serialize_metadata(Rect_positions, sizes, scales)
B = mux(B_frame, Meta)
return B

Decoder

1
2
3
4
5
6
7
8
9
YUV_packed, Meta = demux(B)
Packed = VVC_decode(YUV_packed)
Rects = deserialize(Meta)
Y_hat = zeros(original_size)
for R in Rects:
    sub = extract(Packed, R.packed_pos, R.size_packed)
    sub_orig = rescale(sub, R.size_orig)
    paste(Y_hat, sub_orig, R.orig_pos)
return Y_hat

No retraining or adaptation of the downstream network is required; the reconstructed Y^\hat{Y} serves as input to the pre-existing vision model.

4. Empirical Evaluation

Datasets and Tasks

Experiments span five datasets used in the MPEG Common Test Conditions, including FLIR (infrared), TVD (RGB), OpenImages-Det, and OpenImages-Seg. Two popular models are deployed: Faster R-CNN X101-FPN for object detection and Mask R-CNN X101-FPN for instance segmentation.

Baselines and Metrics

Baseline is the MPEG Remote Inference Anchor: VVC (VTM All-Intra), with full-frame encoding at six quantization parameters (QP values). Evaluation metrics:

  • BD-Rate (%): Average bitrate difference for the same mAP.
  • BD-mAP (%): Average mAP difference at identical bitrate.
  • Rate–Accuracy curves: Plots mAP versus bits-per-pixel (bpp).

Results Table

Task Network Dataset Size BD-Rate BD-mAP
Detection FRCNN X101 FLIR IR 100% −10.86% +0.60%
Detection FRCNN X101 FLIR IR 75% −18.10% +2.17%
Detection FRCNN X101 TVD RGB 100% −40.03% +7.26%
Detection FRCNN X101 TVD RGB 75% −41.94% +8.88%
Segmentation Mask R-CNN OpenImg Seg 100% −31.00% +3.50%
Segmentation Mask R-CNN TVD RGB 75% −44.10% +7.18%

These results indicate up to a 44.10% reduction in bitrate at equal accuracy, or up to +8.88% mAP at equal bitrate compared to full-frame VVC (Eimon et al., 10 Dec 2025). Rate-accuracy curves (mAP vs bpp) consistently favor ROI-Packing, especially on RGB tasks. Robustness extends across RGB and IR images, though IR images at very low bpp exhibit increased sensitivity.

5. Computational Complexity and Hyperparameters

Critical operations and their complexities:

  • Region Detection (YOLOv7): $10$–$20$ ms per frame (GPU).
  • Convex Hull: O(nlogn)O(n \log n); n=n=number of boxes.
  • Grid Alignment & Slicing: O(k)O(k); kk\approx hull vertices.
  • Greedy Bin-Packing: O(m2)O(m^2); m=m=number of rectangles (mm \ll image size).
  • VVC Encoding/Decoding: Same order as standard All-Intra HEVC/VVC.
  • Metadata Overhead: Order of a few bytes per ROI.

Key hyperparameters used:

Parameter Value(s) Description
Padding pp 15 px Pixel padding for each ROI
Grid Size GG 16 Encoder coding unit alignment
QPs 22, 27, 32, 37, 42, 47 Quantization parameters
Scaling Policy Downscale by importance E.g., small/background regions

6. Observations, Limitations, and Directions for Advancement

ROI-Packing achieves up to 44.10% bitrate reduction without loss in downstream task accuracy and may yield up to +8.88% mAP improvement at constant bitrate. No retraining or fine-tuning of the post-decoding vision model is necessary, simplifying deployment. The method displays robustness over both RGB and IR modalities, though IR images at extremely low bpp are more sensitive.

A notable limitation is the method’s current restriction to still frames; the ROI-packing process may disrupt spatial layouts relied upon by certain models, especially for tasks sensitive to contextual or global scene structure. Extension to video remains an open direction: possible approaches include applying ROI-Packing per intra-frame period, exploiting temporal coherence of ROIs, introducing more advanced placement optimizers or learned packing, or adapting scaling policies via learned importance models. These are presented as directions for future research (Eimon et al., 10 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ROI-Packing.