D-FINE-seg: Transformer-Based Instance Segmentation

Updated 4 July 2026

D-FINE-seg is a transformer-based framework that extends the D-FINE detector by adding a lightweight, query-conditioned mask head and segmentation-aware training.
It employs a shared HybridEncoder with multi-scale feature fusion and an integrated transformer decoder to simultaneously predict boxes and masks.
Evaluated on the TACO dataset, it achieves roughly 41% higher mask mAP and 65% improved deployment-style F1 compared to YOLO26-seg with minimal latency overhead.

Searching arXiv for the named framework and closely related segmentation papers to ground the article in current literature. D-FINE-seg is a transformer-based framework for real-time object detection and instance segmentation that extends the D-FINE detector with a lightweight mask head, segmentation-aware training, and a multi-backend deployment pipeline spanning PyTorch, ONNX, TensorRT, and OpenVINO. It retains D-FINE’s end-to-end set prediction regime, denoising queries, Fine-grained Distribution Refinement for box regression, and Global Optimal Localization Self-Distillation, while adding box-cropped BCE and Dice mask losses, auxiliary and denoising mask supervision, and an adapted Hungarian matching cost with mask terms. The framework is released as open-source under the Apache-2.0 license and is evaluated primarily on the TACO dataset under a unified TensorRT FP16 end-to-end benchmarking protocol (Saakyan et al., 26 Feb 2026).

1. Architectural lineage and system definition

D-FINE-seg is defined as an instance segmentation extension of D-FINE rather than a separate detector family. The inherited detection stack comprises a CNN backbone, a HybridEncoder with FPN top-down and PAN bottom-up connections, and a transformer decoder with object queries and contrastive denoising queries. The HybridEncoder outputs multi-scale feature maps at strides 8, 16, and 32, and these encoder outputs are shared by both the detection decoder and the segmentation branch. For each image, the decoder produces a set of query embeddings, and each query feeds both the detection head, which predicts class logits and bounding box distributions, and the mask head, which predicts per-instance mask embeddings (Saakyan et al., 26 Feb 2026).

The detector-side behavior remains DETR-like: end-to-end set prediction, direct use of final decoder detections, and no NMS in the described inference path. The detection head continues to use Fine-grained Distribution Refinement, which predicts distributions over positions and refines them iteratively, and Global Optimal Localization Self-Distillation, which distills knowledge from the last decoder layer into earlier layers.

The framework is implemented in five scales—N, S, M, L, and X—with increasing backbone depth and decoder width. Example parameter counts reported for the segmentation models are N 5.1M, S 11.9M, M 21.2M, L 32.8M, and X 64.3M. This scaling preserves a common architecture while exposing a latency–accuracy design space suited to both GPU and edge deployment.

2. Lightweight mask head and mask generation

The mask head is described as Mask DINO–style in its query-to-mask mechanism but intentionally lighter and restricted to the encoder’s PAN outputs, with no stride-4 backbone feature. Let the HybridEncoder outputs be $F_8 \in \mathbb{R}^{C_8 \times H/8 \times W/8}$ , $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ , and $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ . Each is projected to a common channel dimension, default $C = 256$ :

$\tilde{F}_s = \mathrm{GN}(\mathrm{Conv}_{1\times1}(F_s)), \quad s \in \{8,16,32\}.$

The projected features are fused at stride 8 by bilinear upsampling of the coarser levels and summation:

$F_{\mathrm{fused}}^{1/8} = \tilde{F}_8 + \mathrm{Upsample}(\tilde{F}_{16}) + \mathrm{Upsample}(\tilde{F}_{32}).$

The fused feature is refined by a $3\times3$ conv + GN + ReLU at stride 8 and then bilinearly upsampled to stride 4, followed by another $3\times3$ conv + GN + ReLU, yielding $F_{\mathrm{mask}}^{1/4} \in \mathbb{R}^{C \times H/4 \times W/4}$ (Saakyan et al., 26 Feb 2026).

Mask generation is query-conditioned. For decoder layer $l$ , with query hidden states $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 0, each query embedding $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 1 is projected by a 3-layer MLP into a mask embedding

$F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 2

Reshaping $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 3 to $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 4 gives per-query mask logits

$F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 5

which is operationally equivalent to a dynamic $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 6 convolution with $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 7 as the instance-specific kernel. Final-layer masks are used for inference, while intermediate-layer masks support auxiliary supervision. Because the queries are shared between detection and segmentation, D-FINE-seg is a genuine multi-task architecture rather than a detector with a detached post hoc mask branch.

3. Segmentation-aware optimization and mask-aware matching

The training objective augments the original D-FINE criterion with ROI-cropped mask supervision. For a matched prediction–ground-truth pair, the predicted mask probabilities $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 8 are obtained by sigmoid on the mask logits, and the ground-truth mask $F_{16} \in \mathbb{R}^{C_{16} \times H/16 \times W/16}$ 9 is resized to the mask-head resolution via bilinear interpolation. Let $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 0 be the pixel set inside the ground-truth bounding box mapped to the $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 1-resolution grid. D-FINE-seg computes binary cross-entropy only inside this ROI:

$F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 2

The corresponding soft Dice loss is likewise box-cropped:

$F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 3

$F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 4

The reported mask-loss weights are $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 5 and $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 6 (Saakyan et al., 26 Feb 2026).

Per decoder layer, the total loss is

$F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 7

The full suite is applied to the final decoder layer, and analogous supervision is applied to intermediate layers as auxiliary losses. D-FINE-seg also extends denoising supervision to masks: denoising queries receive their own mask logits and are trained with the same cropped BCE and Dice losses.

Hungarian matching is modified to incorporate masks:

$F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 8

The two added mask costs are a Dice overlap cost and a sigmoid focal-like mask cost computed on the full mask-head resolution grid rather than the cropped ROI. This division of labor is deliberate: matching uses full-mask global similarity, whereas optimization uses box-cropped mask losses aligned with the downstream postprocessing rule that zeros mask pixels outside boxes.

4. Training configuration, dataset, and evaluation protocol

The primary evaluation dataset is TACO, described as containing approximately 1,500 images and 60 waste categories, of which one has no instances, leaving 59 effective classes. The split is 86% training and 14% validation, with partitioning by batch ID to avoid leakage. Polygon masks support segmentation training, and bounding boxes support detection training within the same codebase (Saakyan et al., 26 Feb 2026).

The reported fine-tuning setup uses input size $F_{32} \in \mathbb{R}^{C_{32} \times H/32 \times W/32}$ 9 for both D-FINE-seg and YOLO26 across all scales. YOLO26 is trained for 100 epochs, whereas D-FINE-seg is trained for 50 epochs and is described as converging faster. Batch size is 1. Backbone and detection components are initialized from COCO-pretrained D-FINE weights, while the mask head is initialized from scratch. Reported training features include grouped learning rates for backbone versus decoder, EMA for evaluation and checkpointing, gradient accumulation, DDP support for multi-GPU training, Mosaic augmentation and others, and mask-aware validation with F1, Precision, and Recall.

Two evaluation families are used. The deployment-oriented metrics employ fixed confidence thresholds—0.5 for D-FINE-seg and 0.25 for YOLO26—and one-to-one matching, where IoU $C = 256$ 0 and correct class produce a true positive, multiple predictions on the same ground truth leave only the highest-IoU prediction as true positive, and the rest count as false positives. The corresponding metrics are F1-score, Precision, Recall, and a penalized IoU in which true positives contribute IoU and false positives and false negatives contribute 0. In parallel, COCO-style AP metrics report mAP@50–95 and mAP@50 with IoU thresholds from 0.5 to 0.95 and a confidence threshold as low as 0.01.

The benchmarking protocol is unusually explicit. Both D-FINE(-seg) and YOLO26(-seg) are fine-tuned on TACO, exported to TensorRT FP16, and measured on the same hardware: an NVIDIA RTX 5070 Ti with 16 GB, an Intel i5-12400F, CUDA 12.8, TensorRT 10.10.0.31, and Ultralytics 8.4.6. Latency includes preprocessing, forward pass, and postprocessing, while image loading is excluded. The procedure warms up on 10 samples and then times the full inference pipeline for each of 212 validation images.

5. Accuracy–latency characteristics and deployment behavior

On TACO instance segmentation, D-FINE-seg is reported to outperform YOLO26-seg on most scales under both COCO-style and deployment-style measurements. For mask mAP@50–95, the values are 0.094 versus 0.041 at N, 0.177 versus 0.111 at S, 0.157 versus 0.195 at M, 0.212 versus 0.174 at L, and 0.242 versus 0.210 at X. For mask mAP@50, the corresponding values are 0.141 versus 0.058, 0.250 versus 0.165, 0.229 versus 0.270, 0.310 versus 0.242, and 0.340 versus 0.291. Averaged over scales, the paper characterizes D-FINE-seg as achieving approximately 41% higher mask mAP overall, with YOLO26-seg M as the only scale that slightly outperforms D-FINE-seg M. Under deployment-style segmentation metrics, the paper reports an average relative F1-score improvement of about 65% versus YOLO26-seg with roughly 10% latency overhead (Saakyan et al., 26 Feb 2026).

At S scale, the deployment-oriented comparison is representative: D-FINE-seg S attains F1 0.263, IoU 0.125, Precision 0.339, Recall 0.215, end-to-end TensorRT FP16 latency 5.0 ms, and raw forward latency 2.2 ms. YOLO26-seg S attains F1 0.177, IoU 0.080, Precision 0.278, Recall 0.130, end-to-end latency 4.3 ms, and raw forward latency 1.4 ms. The same framework also supports detection-only D-FINE models; at S scale, D-FINE reports box mAP@50–95 of 0.202 and mAP@50 of 0.244 versus YOLO26 S at 0.098 and 0.124, and deployment-style F1 of 0.274 versus 0.170 at 3.6 ms versus 3.5 ms. The paper summarizes the detection side as roughly 49% higher box mAP and roughly 70% higher F1 with approximately 1% average overhead.

The deployment story is central rather than ancillary. ONNX export is provided for detection and segmentation variants, uses standard operators rather than custom ops, and supports dynamic and static shapes, although typical deployment uses static $C = 256$ 1. TensorRT FP32 and FP16 engines are generated from ONNX, and segmentation postprocessing consists of confidence filtering, resizing masks from $C = 256$ 2 resolution to the original image size, thresholding masks, and zeroing pixels outside boxes. OpenVINO IR supports FP32, FP16, and INT8 with accuracy-aware quantization. On an Intel N150, D-FINE-seg S reaches F1 0.264 at 431.2 ms in FP32, F1 0.264 at 272.2 ms in FP16, and F1 0.243 at 205.0 ms in INT8, whereas YOLO26-seg S INT8 reaches F1 0.153 at 113.6 ms. For D-FINE-seg S specifically, the format conversion results are:

Format	F1	Latency
PyTorch FP32	0.263	20.4 ms
TensorRT FP32	0.264	6.5 ms
TensorRT FP16	0.263	5.0 ms

These figures support the paper’s conclusion that TensorRT FP16 yields approximately $C = 256$ 3 speedup relative to PyTorch FP32 with essentially identical accuracy. The implementation is released at https://github.com/ArgoHA/D-FINE-seg under Apache-2.0.

6. Nomenclature, scope, and limitations

The name D-FINE-seg can be confused with unrelated uses of “dynamic,” “focus-aware,” or “fine-grained” segmentation terminology. In particular, "Dynamic Focus-aware Positional Queries for Semantic Segmentation" introduces DFPQ, HRCA, and the full system FASeg; the acronym “D-FINE-seg” does not appear there, and that work addresses semantic segmentation by modifying Mask2Former rather than extending the D-FINE detector family (He et al., 2022). Likewise, FH-Seg is a full-scale hierarchical learning framework for fine-grained renal vasculature segmentation, and DFNet is a decision fusion network with perception fine-tuning for defect classification; neither uses the D-FINE-seg architecture or name (Long et al., 7 Feb 2025, Jiang et al., 2023). A recurrent misconception is therefore to treat D-FINE-seg as a generic label for any dynamic or fine-grained segmentation method; in the literature summarized here, it specifically denotes the D-FINE-based real-time detection and instance segmentation framework.

The paper also delineates several limitations. Results are reported mainly on TACO rather than on broader benchmarks such as COCO or Cityscapes. The mask head is not yet COCO-pretrained and is initialized from scratch, which the authors identify as a likely limiter of mask boundary quality and overall mask AP. Masks are predicted at $C = 256$ 4 resolution with no multi-scale refinement, so extremely small or thin objects may benefit from higher-resolution or more complex heads. The comparison with YOLO26 uses different default confidence thresholds—0.5 for D-FINE-seg and 0.25 for YOLO26—which the authors present as realistic but which complicates strict threshold-controlled cross-model comparisons. Finally, on the Intel N150 edge platform, D-FINE-seg INT8 remains slower than YOLO26-seg INT8, even though it is more accurate.

Within those bounds, D-FINE-seg occupies a specific niche: an export-oriented, transformer-based real-time instance segmentation framework that attempts to preserve D-FINE’s detector characteristics while adding a mask branch and mask-aware training with minimal architectural complexity.