ROI-Packing: Efficient Region-Based Compression

Updated 17 December 2025

ROI-Packing is an image compression framework that identifies and packs regions of interest to reduce bitrate without compromising task accuracy.
It employs ROI detection, padding, convex hull merging, and greedy bin-packing to efficiently encode key subregions for detection and segmentation.
Empirical evaluations show up to 44.10% bitrate reduction and improved mAP, all without retraining existing vision models.

ROI-Packing is an efficient region-based image compression framework designed for machine vision contexts where downstream task performance is of primary importance. The approach uses explicit identification and selection of Regions of Interest (ROIs) in an input image, packs these regions into a compact representation, and leverages standard codecs for transmission, enabling substantial reductions in bitrate with minimal loss—or in some cases improvement—in downstream task accuracy. ROI-Packing operates without the need to retrain or fine-tune end-task models and supports both detection and segmentation pipelines (Eimon et al., 10 Dec 2025).

1. Pipeline Structure and Operation

ROI-Packing comprises two main modules: an encoder (typically on an edge device) and a decoder (often deployed server-side). The process begins with ROI extraction from the input image $X$ , targeting regions critical to the target visual task (e.g., object detection, instance segmentation). The key steps are as follows:

ROI Detection: Use of a region detector (e.g., YOLOv7), which outputs bounding boxes $B = \{ b_1, ..., b_n \}$ .
Region Padding: Expansion of each detected box $b_i$ by $p$ pixels (with $p=15$ as the default).
Overlap Merging: Construction of the convex hull over all corners of the padded boxes, followed by alignment of the hull to a grid corresponding to the encoder’s coding units (e.g., $16 \times 16$ ).
Slicing and Merging: Split the aligned convex hull into rectangular sub-boxes by slicing at polygon vertices and merging adjacent slices to maximize the rectangle sizes.
Region Scaling: Optional downscaling of less critical regions (background, objects below a certain size or importance threshold).
Packing: Bin-packing all sub-boxes into a single packed frame of size $(W_{\text{bin}}, H_{\text{bin}})$ using a greedy best-fit heuristic.
Encoding and Transmission: Convert the packed image to YUV, encode with a standard All-Intra profile (e.g., VVC), and multiplex metadata (positions, sizes, scales) for transmission.

The decoder performs the inverse steps: after standard decoding, it unpacks the received rectangles, optionally rescales subregions, pastes them into the original image canvas, fills background with zeros, and passes the result to the target vision model.

2. Mathematical Formulation

2.1 ROI Representation and Merging

Each detected ROI $R_i$ is described by top-left coordinates $(x_i, y_i)$ and dimensions $w_i, h_i$ , with four key points: $v_1^i = (x_i, y_i), \quad v_2^i = (x_i+w_i, y_i),\quad v_3^i = (x_i, y_i+h_i),\quad v_4^i = (x_i+w_i, y_i+h_i)$ with $P = \bigcup_{i=1}^n \{v_1^i,\ldots,v_4^i\}$ . The convex hull $CH(P)$ produces the minimal enclosing polygon, which is then grid-aligned. Rectangular sub-boxes are constructed by greedy slicing to maximize packing efficiency.

2.2 Packing Optimization

With $m$ final rectangles $R_j = (w_j, h_j)$ , each placed at position $(x_j, y_j)$ , packing must satisfy: $x_j + w_j \leq W_{\text{bin}},\quad y_j + h_j \leq H_{\text{bin}},\quad \text{and}\quad R_i \cap R_j = \emptyset\ \forall i \neq j$ A best-fit bin-packing heuristic is used to minimize the unused area: $S_{\text{unused}} = W_{\text{bin}} \cdot H_{\text{bin}} - \sum_{\text{placed } k} (w_k \cdot h_k)$

2.3 Rate–Distortion-Accuracy Trade-off

While not explicitly Lagrangian, the optimization can be framed as: $\min R_{\text{total}} \quad \text{subject to } D_{\text{task}} (\hat{Y}, X) \leq D_{\text{max}}$ or, equivalently,

$\min D_{\text{task}} + \lambda R_{\text{total}}$

where $R_{\text{total}}$ is the total bitrate (including metadata), and $D_{\text{task}}$ is application-specific distortion, typically the accuracy drop in downstream tasks (e.g., mAP). Empirical metrics reported include BD-Rate (bitrate change for constant accuracy) and BD-mAP (accuracy change for constant bitrate).

3. Implementation: Pseudocode and Workflow

Encoder

B_boxes = D.detect(X)                # ROI detection (e.g., YOLOv7)
for b in B_boxes:
    b = expand(b, p)                 # Padding
P = union_of_corners(B_boxes)
Hull = convex_hull(P)
AlignedHull = align_to_grid(Hull, G)
Rects = slice_and_merge(AlignedHull)
for R in Rects:
    R = scale_if_needed(R, S)
Packed = bin_pack(Rects, W_bin, H_bin)
YUV = rgb2yuv(Packed)
B_frame = VVC_encode(YUV, QP)
Meta = serialize_metadata(Rect_positions, sizes, scales)
B = mux(B_frame, Meta)
return B

Decoder

YUV_packed, Meta = demux(B)
Packed = VVC_decode(YUV_packed)
Rects = deserialize(Meta)
Y_hat = zeros(original_size)
for R in Rects:
    sub = extract(Packed, R.packed_pos, R.size_packed)
    sub_orig = rescale(sub, R.size_orig)
    paste(Y_hat, sub_orig, R.orig_pos)
return Y_hat

No retraining or adaptation of the downstream network is required; the reconstructed $\hat{Y}$ serves as input to the pre-existing vision model.

4. Empirical Evaluation

Datasets and Tasks

Experiments span five datasets used in the MPEG Common Test Conditions, including FLIR (infrared), TVD (RGB), OpenImages-Det, and OpenImages-Seg. Two popular models are deployed: Faster R-CNN X101-FPN for object detection and Mask R-CNN X101-FPN for instance segmentation.

Baselines and Metrics

Baseline is the MPEG Remote Inference Anchor: VVC (VTM All-Intra), with full-frame encoding at six quantization parameters (QP values). Evaluation metrics:

BD-Rate (%): Average bitrate difference for the same mAP.
BD-mAP (%): Average mAP difference at identical bitrate.
Rate–Accuracy curves: Plots mAP versus bits-per-pixel (bpp).

Results Table

Task	Network	Dataset	Size	BD-Rate	BD-mAP
Detection	FRCNN X101	FLIR IR	100%	−10.86%	+0.60%
Detection	FRCNN X101	FLIR IR	75%	−18.10%	+2.17%
Detection	FRCNN X101	TVD RGB	100%	−40.03%	+7.26%
Detection	FRCNN X101	TVD RGB	75%	−41.94%	+8.88%
Segmentation	Mask R-CNN	OpenImg Seg	100%	−31.00%	+3.50%
Segmentation	Mask R-CNN	TVD RGB	75%	−44.10%	+7.18%

These results indicate up to a 44.10% reduction in bitrate at equal accuracy, or up to +8.88% mAP at equal bitrate compared to full-frame VVC (Eimon et al., 10 Dec 2025). Rate-accuracy curves (mAP vs bpp) consistently favor ROI-Packing, especially on RGB tasks. Robustness extends across RGB and IR images, though IR images at very low bpp exhibit increased sensitivity.

5. Computational Complexity and Hyperparameters

Critical operations and their complexities:

Region Detection (YOLOv7): $10$–$20$ ms per frame (GPU).
Convex Hull: $O(n \log n)$ ; $n=$ number of boxes.
Grid Alignment & Slicing: $O(k)$ ; $k\approx$ hull vertices.
Greedy Bin-Packing: $O(m^2)$ ; $m=$ number of rectangles ( $m \ll$ image size).
VVC Encoding/Decoding: Same order as standard All-Intra HEVC/VVC.
Metadata Overhead: Order of a few bytes per ROI.

Key hyperparameters used:

Parameter	Value(s)	Description
Padding $p$	15 px	Pixel padding for each ROI
Grid Size $G$	16	Encoder coding unit alignment
QPs	22, 27, 32, 37, 42, 47	Quantization parameters
Scaling Policy	Downscale by importance	E.g., small/background regions

6. Observations, Limitations, and Directions for Advancement

ROI-Packing achieves up to 44.10% bitrate reduction without loss in downstream task accuracy and may yield up to +8.88% mAP improvement at constant bitrate. No retraining or fine-tuning of the post-decoding vision model is necessary, simplifying deployment. The method displays robustness over both RGB and IR modalities, though IR images at extremely low bpp are more sensitive.

A notable limitation is the method’s current restriction to still frames; the ROI-packing process may disrupt spatial layouts relied upon by certain models, especially for tasks sensitive to contextual or global scene structure. Extension to video remains an open direction: possible approaches include applying ROI-Packing per intra-frame period, exploiting temporal coherence of ROIs, introducing more advanced placement optimizers or learned packing, or adapting scaling policies via learned importance models. These are presented as directions for future research (Eimon et al., 10 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ROI-Packing: Efficient Region-Based Compression for Machine Vision (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ROI-Packing.