Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiRefNet for High-Res Dichotomous Segmentation

Updated 4 July 2026
  • BiRefNet is a framework for high-resolution dichotomous image segmentation that decomposes the task into global semantic localization and fine-detail reconstruction.
  • It employs a dual guidance system—InRef using original image patches and OutRef using gradient maps—to preserve thin structures and precise boundaries.
  • Extensive benchmarks and ablation studies demonstrate its superior performance across tasks, including specialized adaptations like anime background removal in ToonOut.

BiRefNet, short for Bilateral Reference for High-Resolution Dichotomous Image Segmentation, is a segmentation framework for high-resolution dichotomous image segmentation (DIS): the separation of a foreground object from background in large, detail-rich images where naive downsampling destroys the thin structures, curved boundaries, and small disconnected parts that determine mask quality (Zheng et al., 2024). The framework decomposes the problem into global semantic localization and high-resolution reconstruction, implemented as a Localization Module (LM) and a Reconstruction Module (RM) coupled by a bilateral reference mechanism that uses both original-image content and gradient-based guidance during decoding (Zheng et al., 2024). In subsequent work, BiRefNet also served as the base DIS model for ToonOut, a domain-specific fine-tuned system for anime background removal, where its dual supervision and reconstruction design were used to address hair strands, transparent elements, sharp contours, shadows, and ambiguous character-background boundaries (Muratori et al., 8 Sep 2025).

1. Problem setting and design rationale

BiRefNet was introduced for high-resolution dichotomous image segmentation, a regime in which foreground objects are often thin, complex, and fine-grained, and the input images are large enough that aggressive resizing suppresses precisely the details that matter most for accurate masks (Zheng et al., 2024). The core difficulty is not only detecting where the target object is, but also recovering the boundary geometry and small structural fragments that disappear under coarse feature extraction.

The framework addresses this by a two-stage architectural decomposition. The first stage performs semantic localization using global context, and the second stage reconstructs a high-resolution segmentation map while being continuously guided toward fine-detail regions (Zheng et al., 2024). This division reflects the paper’s claim that coarse localization alone is insufficient for HR segmentation, whereas reconstruction without robust semantic grounding risks amplifying local noise.

The same design rationale explains why BiRefNet was later selected as the basis for anime background removal in ToonOut. That work describes BiRefNet as a popular, high-performing bilateral reference network for DIS and emphasizes that its design is well matched to anime imagery, where segmentation must simultaneously preserve global object shape and tiny details such as hair, fingers, transparent edges, and clothing folds (Muratori et al., 8 Sep 2025). This suggests that BiRefNet’s original HR-DIS formulation was already structurally compatible with stylized foreground extraction, even though it was not introduced specifically for anime imagery.

2. Architectural decomposition: Localization Module and Reconstruction Module

BiRefNet is organized around a Localization Module (LM) and a Reconstruction Module (RM) (Zheng et al., 2024). Given a batch of HR images IRN×3×H×W\mathcal{I}\in \mathbb{R}^{N\times 3\times H\times W}, a Swin Transformer encoder produces hierarchical features

F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e

at resolutions {H/k,W/k}\{H/k, W/k\} for k=4,8,16,32k=4,8,16,32 (Zheng et al., 2024). The first three stage features are transferred to the decoder through lateral 1×11\times1 convolutions, while the deepest feature is stacked and concatenated into Fe\mathcal{F}^e.

Within the LM, the deepest feature is used in two ways. A classification head with global average pooling and a fully connected layer improves semantic representation for localization, and an ASPP module squeezes the feature into the decoder seed Fd\mathcal{F}^d (Zheng et al., 2024). The LM therefore provides rough target positions, category-aware semantics, and coarse structural guidance, but it does not itself recover the fine boundaries and small fragments characteristic of HR foreground extraction.

The RM then performs progressive refinement from coarse to full resolution. Rather than using a plain residual decoder, BiRefNet employs BiRef blocks, each built around a reconstruction block (RB) (Zheng et al., 2024). The RB combines deformable convolutions, receptive fields of 1×11\times1, 3×33\times3, and 7×77\times7, and adaptive average pooling to extract multi-scale features, which are concatenated as F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e0 and transformed into F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e1 (Zheng et al., 2024). Decoder features are fused with the corresponding lateral features by

F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e2

and intermediate predictions F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e3 are supervised before the final map

F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e4

is produced by a final F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e5 convolution (Zheng et al., 2024).

The significance of this decoder design lies in its refusal to reconstruct all missing detail in a single step. Instead, structure is refined stage by stage at increasing resolution, which is the operational basis for the paper’s claim that BiRefNet preserves thin structures and curved boundaries better than prior approaches (Zheng et al., 2024).

3. Bilateral reference: inward and outward guidance

The defining mechanism of BiRefNet is bilateral reference, which supplies two complementary forms of guidance during reconstruction: Inward reference (InRef) and Outward reference (OutRef) (Zheng et al., 2024). The bilateral formulation is intended to preserve raw image evidence while also directing the decoder toward the parts of the image most likely to encode fine structure.

InRef uses the original high-resolution image as a source reference. Instead of resizing the full image to match decoder stages, BiRefNet adaptively crops the original-resolution image into patches F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e6 that match the spatial size of the corresponding decoder features, then stacks these patches with F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e7 before reconstruction (Zheng et al., 2024). This preserves HR detail that would otherwise be blurred by compression and is especially useful for thin boundaries and small structures.

OutRef uses gradient maps as a target reference. The paper argues that many of the details that matter in DIS are strongly reflected in image gradients, because edges, contours, and subtle structures are high-gradient regions (Zheng et al., 2024). From F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e8, the model predicts a gradient map F1e,F2e,F3e,Fe\mathcal{F}_1^e,\mathcal{F}_2^e,\mathcal{F}_3^e,\mathcal{F}^e9, transforms it through a convolution and sigmoid into a gradient referring attention map {H/k,W/k}\{H/k, W/k\}0, and uses that attention to modulate the decoder feature for the next stage (Zheng et al., 2024). In effect, the model is taught to look where detailed structure is concentrated.

To prevent strong background gradients from becoming distractors, BiRefNet applies masked gradient supervision. An intermediate prediction {H/k,W/k}\{H/k, W/k\}1 is processed by morphological operations and dilation to obtain a mask, and the ground-truth-like gradient map {H/k,W/k}\{H/k, W/k\}2 is multiplied by that mask to form {H/k,W/k}\{H/k, W/k\}3 (Zheng et al., 2024). This auxiliary supervision sharpens the model’s sensitivity to target-aware fine details rather than generic edge energy.

The ToonOut fine-tuning study later highlights closely related properties when explaining the choice of BiRefNet for anime segmentation. It identifies auxiliary gradient supervision as important for preserving fine details and ground-truth supervision as useful where foreground and background are similar in color or texture; it also emphasizes the value of BiRefNet’s Localization Module and Reconstruction Module for jointly handling global object shape and tiny details in anime art (Muratori et al., 8 Sep 2025). This is not a new architectural claim, but a domain-specific validation of the original bilateral-reference design.

4. Optimization objective and practical training strategies

BiRefNet uses a hybrid loss motivated by the claim that pure pixelwise BCE degrades detail quality in HR segmentation (Zheng et al., 2024). The final objective is

{H/k,W/k}\{H/k, W/k\}4

with

{H/k,W/k}\{H/k, W/k\}5

Here, BCE provides pixel-level supervision, IoU supplies region-level supervision, SSIM penalizes structural mismatch and sharpens boundary quality, and CE supports semantic supervision through the classification head (Zheng et al., 2024). The gradient branch is additionally supervised by {H/k,W/k}\{H/k, W/k\}6 and {H/k,W/k}\{H/k, W/k\}7.

The paper also outlines several practical training strategies tailored for DIS. It states that long training helps fine details, because coarse localization converges relatively quickly whereas detailed structure continues improving much longer (Zheng et al., 2024). Multi-stage supervision (MSS) accelerates learning by supervising intermediate decoder outputs and is reported to reduce training time substantially in the ablation. Region-level loss fine-tuning (RLFT) improves binarization and practical metrics in the last training epochs, and context feature fusion (CFF) plus image pyramid input (IPT) further improve performance on HR data (Zheng et al., 2024).

The reported training configuration is: images resized to {H/k,W/k}\{H/k, W/k\}8, horizontal flip augmentation only, Adam optimizer, DIS/HRSOD/COD trained for 800/120/120 epochs, the last 20 epochs fine-tuned with IoU loss, learning rate {H/k,W/k}\{H/k, W/k\}9, batch size 2 per GPU, and implementation in PyTorch on a single NVIDIA A100 40GB GPU (Zheng et al., 2024). These details are important because the paper presents BiRefNet not only as an architectural contribution but also as a practical training recipe for HR segmentation.

5. Benchmarks, metrics, and ablation evidence

BiRefNet is evaluated on four tasks: DIS, HRSOD, COD, and SOD (Zheng et al., 2024). The datasets include DIS5K-TR for DIS training with tests on DIS-TE1, TE2, TE3, TE4, and DIS-VD; combinations of HRSOD, UHRSD, and DUTS for HRSOD; CAMO-TR and COD10K-TR for COD; and supplementary SOD evaluation on DUTS-TE and DUT-OMRON (Zheng et al., 2024). The reported metrics include k=4,8,16,32k=4,8,16,320, max/mean/weighted k=4,8,16,32k=4,8,16,321, max/mean k=4,8,16,32k=4,8,16,322, MAE, and relaxed k=4,8,16,32k=4,8,16,323 with k=4,8,16,32k=4,8,16,324 used in experiments (Zheng et al., 2024).

The paper states that BiRefNet reports state-of-the-art results across all benchmarks. On the combined DIS-TE(1–4) set, it reports

  • k=4,8,16,32k=4,8,16,325,
  • k=4,8,16,32k=4,8,16,326,
  • k=4,8,16,32k=4,8,16,327,
  • k=4,8,16,32k=4,8,16,328,
  • k=4,8,16,32k=4,8,16,329 (Zheng et al., 2024).

For HRSOD, the paper claims an average 2.0% 1×11\times10 improvement over prior SOTA, and for COD it reports an average 6.7% improvement in 1×11\times11 over prior work (Zheng et al., 2024). The cross-task evaluation is used to argue that BiRefNet is not merely a DIS-specific construction but a generally useful framework for high-resolution, class-agnostic segmentation.

The ablation studies isolate the role of the main components. The RM alone improves over the baseline; InRef alone helps by injecting lossless HR source content; OutRef alone helps by steering attention toward gradient-rich details; and RM + InRef + OutRef gives the best results, indicating complementarity between the two references (Zheng et al., 2024). The paper further reports that CFF + IPT + RLFT yield additional gains and that MSS is particularly effective in reducing training time while preserving almost the same performance as long training (Zheng et al., 2024). This body of evidence is central to the framework’s identity: the model is not only a transformer encoder with a decoder, but specifically a reconstruction system whose gains derive from bilateral referencing and HR-specific training design.

6. Domain-specific adaptation: BiRefNet as the base of ToonOut

In ToonOut: Fine-tuned Background-Removal for Anime Characters, BiRefNet is not introduced as a new architecture but used as the base dichotomous image segmentation model for anime background removal (Muratori et al., 8 Sep 2025). The authors collected and annotated a custom dataset of 1,228 high-quality anime images split into 979 training, 123 validation, and 126 test images, corresponding to an 80/10/10 split (Muratori et al., 8 Sep 2025). The six sub-domains are Reference, Emotion, Pose, Factory, Action, and Items, intended to cover neutral portraits, expressive close-ups, dynamic full-body poses, idle full-body characters, character-object interactions, and standalone objects (Muratori et al., 8 Sep 2025).

The dataset was generated using Yamer’s Anime, an anime-specialized checkpoint of Stable Diffusion XL, then manually filtered to remove anatomical inconsistencies, unclear foreground-background boundaries, and artifacts that would make segmentation masks visually poor (Muratori et al., 8 Sep 2025). Each sample contains an original RGB image and a pixel-level ground-truth mask encoded as a grayscale alpha-style mask in which black is background, white is foreground, and intermediate gray denotes partially transparent pixels (Muratori et al., 8 Sep 2025). Training examples were selected with a difficulty-aware strategy: the baseline BiRefNet was first run on candidate images, and images where BiRefNet performed poorly were prioritized, while up to 20% of well-segmented images were retained to preserve balance (Muratori et al., 8 Sep 2025).

Fine-tuning starts from a BiRefNet checkpoint saved after 244 training epochs and proceeds for 46 epochs on 2 GeForce RTX 4090 GPUs with batch size = 2, initial learning rate = 1e-5, halving after 20 and 40 epochs, and gradient clipping at 100 (Muratori et al., 8 Sep 2025). The loss is

1×11\times12

with 1×11\times13, 1×11\times14, and 1×11\times15, plus binary cross-entropy loss to supervise the gradients (Muratori et al., 8 Sep 2025). The paper explicitly states that the improvement comes from domain-specific fine-tuning plus the custom data and loss setup, rather than from major architectural changes to BiRefNet itself.

Evaluation on the 126-image test set compares ToonOut against Photoroom, Briaai2.0, and the original BiRefNet (Muratori et al., 8 Sep 2025). The main headline metrics are Pixel Accuracy (PA), Boundary IoU (BIoU), and Weighted F-measure (WF). PA is introduced as a perceptual metric for anime alpha masks: a pixel is correct when the absolute difference between predicted alpha and ground-truth alpha satisfies 1×11\times16; the error mask is eroded once to ignore tiny 1-pixel boundary artifacts; and “foreground pixels” are defined by 1×11\times17 (Muratori et al., 8 Sep 2025). On the test set, BiRefNet reports 95.3% PA, 88.5% Mean Boundary IoU, and 97.8% Weighted F-measure, whereas ToonOut reports 99.5% PA, 95.6% Mean Boundary IoU, and 99.4% Weighted F-measure (Muratori et al., 8 Sep 2025). The largest category-level gain appears in Action, where PA rises from 76.8% to 99.0%, BIoU from 69.4% to 93.1%, and WF from 91.2% to 99.3% (Muratori et al., 8 Sep 2025).

The ToonOut study identifies recurring failure modes in anime segmentation—hair strands and hair volume, transparent or semi-transparent edges, clothing folds and limb-body junctions, shadows, and complex interactions with items and props—and argues that BiRefNet’s localization-plus-reconstruction design is particularly suited to those cases (Muratori et al., 8 Sep 2025). It also open-sources the code, fine-tuned model weights, and dataset at https://github.com/MatteoKartoon/BiRefNet (Muratori et al., 8 Sep 2025). A plausible implication is that BiRefNet’s most durable contribution is not only its original cross-task HR segmentation performance, but also its utility as a foundation model for specialized foreground extraction domains where fine structures dominate error perception.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BiRefNet.