ROI-Packing: Efficient Region-Based Compression
- ROI-Packing is an image compression framework that identifies and packs regions of interest to reduce bitrate without compromising task accuracy.
- It employs ROI detection, padding, convex hull merging, and greedy bin-packing to efficiently encode key subregions for detection and segmentation.
- Empirical evaluations show up to 44.10% bitrate reduction and improved mAP, all without retraining existing vision models.
ROI-Packing is an efficient region-based image compression framework designed for machine vision contexts where downstream task performance is of primary importance. The approach uses explicit identification and selection of Regions of Interest (ROIs) in an input image, packs these regions into a compact representation, and leverages standard codecs for transmission, enabling substantial reductions in bitrate with minimal loss—or in some cases improvement—in downstream task accuracy. ROI-Packing operates without the need to retrain or fine-tune end-task models and supports both detection and segmentation pipelines (Eimon et al., 10 Dec 2025).
1. Pipeline Structure and Operation
ROI-Packing comprises two main modules: an encoder (typically on an edge device) and a decoder (often deployed server-side). The process begins with ROI extraction from the input image , targeting regions critical to the target visual task (e.g., object detection, instance segmentation). The key steps are as follows:
- ROI Detection: Use of a region detector (e.g., YOLOv7), which outputs bounding boxes .
- Region Padding: Expansion of each detected box by pixels (with as the default).
- Overlap Merging: Construction of the convex hull over all corners of the padded boxes, followed by alignment of the hull to a grid corresponding to the encoder’s coding units (e.g., ).
- Slicing and Merging: Split the aligned convex hull into rectangular sub-boxes by slicing at polygon vertices and merging adjacent slices to maximize the rectangle sizes.
- Region Scaling: Optional downscaling of less critical regions (background, objects below a certain size or importance threshold).
- Packing: Bin-packing all sub-boxes into a single packed frame of size using a greedy best-fit heuristic.
- Encoding and Transmission: Convert the packed image to YUV, encode with a standard All-Intra profile (e.g., VVC), and multiplex metadata (positions, sizes, scales) for transmission.
The decoder performs the inverse steps: after standard decoding, it unpacks the received rectangles, optionally rescales subregions, pastes them into the original image canvas, fills background with zeros, and passes the result to the target vision model.
2. Mathematical Formulation
2.1 ROI Representation and Merging
Each detected ROI is described by top-left coordinates and dimensions , with four key points: with . The convex hull produces the minimal enclosing polygon, which is then grid-aligned. Rectangular sub-boxes are constructed by greedy slicing to maximize packing efficiency.
2.2 Packing Optimization
With final rectangles , each placed at position , packing must satisfy: A best-fit bin-packing heuristic is used to minimize the unused area:
2.3 Rate–Distortion-Accuracy Trade-off
While not explicitly Lagrangian, the optimization can be framed as: or, equivalently,
where is the total bitrate (including metadata), and is application-specific distortion, typically the accuracy drop in downstream tasks (e.g., mAP). Empirical metrics reported include BD-Rate (bitrate change for constant accuracy) and BD-mAP (accuracy change for constant bitrate).
3. Implementation: Pseudocode and Workflow
Encoder
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
B_boxes = D.detect(X) # ROI detection (e.g., YOLOv7) for b in B_boxes: b = expand(b, p) # Padding P = union_of_corners(B_boxes) Hull = convex_hull(P) AlignedHull = align_to_grid(Hull, G) Rects = slice_and_merge(AlignedHull) for R in Rects: R = scale_if_needed(R, S) Packed = bin_pack(Rects, W_bin, H_bin) YUV = rgb2yuv(Packed) B_frame = VVC_encode(YUV, QP) Meta = serialize_metadata(Rect_positions, sizes, scales) B = mux(B_frame, Meta) return B |
Decoder
1 2 3 4 5 6 7 8 9 |
YUV_packed, Meta = demux(B) Packed = VVC_decode(YUV_packed) Rects = deserialize(Meta) Y_hat = zeros(original_size) for R in Rects: sub = extract(Packed, R.packed_pos, R.size_packed) sub_orig = rescale(sub, R.size_orig) paste(Y_hat, sub_orig, R.orig_pos) return Y_hat |
No retraining or adaptation of the downstream network is required; the reconstructed serves as input to the pre-existing vision model.
4. Empirical Evaluation
Datasets and Tasks
Experiments span five datasets used in the MPEG Common Test Conditions, including FLIR (infrared), TVD (RGB), OpenImages-Det, and OpenImages-Seg. Two popular models are deployed: Faster R-CNN X101-FPN for object detection and Mask R-CNN X101-FPN for instance segmentation.
Baselines and Metrics
Baseline is the MPEG Remote Inference Anchor: VVC (VTM All-Intra), with full-frame encoding at six quantization parameters (QP values). Evaluation metrics:
- BD-Rate (%): Average bitrate difference for the same mAP.
- BD-mAP (%): Average mAP difference at identical bitrate.
- Rate–Accuracy curves: Plots mAP versus bits-per-pixel (bpp).
Results Table
| Task | Network | Dataset | Size | BD-Rate | BD-mAP |
|---|---|---|---|---|---|
| Detection | FRCNN X101 | FLIR IR | 100% | −10.86% | +0.60% |
| Detection | FRCNN X101 | FLIR IR | 75% | −18.10% | +2.17% |
| Detection | FRCNN X101 | TVD RGB | 100% | −40.03% | +7.26% |
| Detection | FRCNN X101 | TVD RGB | 75% | −41.94% | +8.88% |
| Segmentation | Mask R-CNN | OpenImg Seg | 100% | −31.00% | +3.50% |
| Segmentation | Mask R-CNN | TVD RGB | 75% | −44.10% | +7.18% |
These results indicate up to a 44.10% reduction in bitrate at equal accuracy, or up to +8.88% mAP at equal bitrate compared to full-frame VVC (Eimon et al., 10 Dec 2025). Rate-accuracy curves (mAP vs bpp) consistently favor ROI-Packing, especially on RGB tasks. Robustness extends across RGB and IR images, though IR images at very low bpp exhibit increased sensitivity.
5. Computational Complexity and Hyperparameters
Critical operations and their complexities:
- Region Detection (YOLOv7): $10$–$20$ ms per frame (GPU).
- Convex Hull: ; number of boxes.
- Grid Alignment & Slicing: ; hull vertices.
- Greedy Bin-Packing: ; number of rectangles ( image size).
- VVC Encoding/Decoding: Same order as standard All-Intra HEVC/VVC.
- Metadata Overhead: Order of a few bytes per ROI.
Key hyperparameters used:
| Parameter | Value(s) | Description |
|---|---|---|
| Padding | 15 px | Pixel padding for each ROI |
| Grid Size | 16 | Encoder coding unit alignment |
| QPs | 22, 27, 32, 37, 42, 47 | Quantization parameters |
| Scaling Policy | Downscale by importance | E.g., small/background regions |
6. Observations, Limitations, and Directions for Advancement
ROI-Packing achieves up to 44.10% bitrate reduction without loss in downstream task accuracy and may yield up to +8.88% mAP improvement at constant bitrate. No retraining or fine-tuning of the post-decoding vision model is necessary, simplifying deployment. The method displays robustness over both RGB and IR modalities, though IR images at extremely low bpp are more sensitive.
A notable limitation is the method’s current restriction to still frames; the ROI-packing process may disrupt spatial layouts relied upon by certain models, especially for tasks sensitive to contextual or global scene structure. Extension to video remains an open direction: possible approaches include applying ROI-Packing per intra-frame period, exploiting temporal coherence of ROIs, introducing more advanced placement optimizers or learned packing, or adapting scaling policies via learned importance models. These are presented as directions for future research (Eimon et al., 10 Dec 2025).