Papers
Topics
Authors
Recent
2000 character limit reached

MitUNet: Hybrid Segmentation for Floor Plans

Updated 9 December 2025
  • MitUNet is a hybrid architecture that combines a Mix-Transformer encoder with a U-Net decoder to deliver high-precision semantic segmentation of wall structures.
  • It employs scSE attention modules and an asymmetric Tversky loss to enhance boundary recovery and balance class imbalance, ensuring sensitivity to thin walls.
  • The design produces high-fidelity wall masks suitable for direct vectorization in automated 3D reconstruction workflows, reducing post-processing efforts.

MitUNet is a hybrid neural network architecture designed for high-precision semantic segmentation of wall structures in 2D floor plans, specifically targeting the semantic fidelity required by automated 3D reconstruction pipelines. The model synergistically combines hierarchical global context modeling through a Mix-Transformer encoder and precise boundary recovery via a U-Net decoder enhanced with spatial and channel-wise squeeze-and-excitation (scSE) attention. An asymmetric Tversky loss function is employed during optimization to strictly control boundary noise and maintain sensitivity to thin wall segments, yielding masks suitable for direct vectorization and further geometric processing (Parashchuk et al., 2 Dec 2025).

1. Architectural Overview

MitUNet integrates two core components: a hierarchical Mix-Transformer encoder and a U-Net style decoder augmented with scSE attention modules. The encoder backbone is SegFormer’s MiT-b4, pretrained on ImageNet, which employs overlapping patch merging and multi-scale feature extraction with four resolutive stages. Each stage utilizes Mix-Transformer blocks—sequentially LayerNorm, efficient self-attention, MLP, depth-wise convolution, and residual connections—to enact both global token mixing and local spatial integration.

Decoder reconstruction operates in reverse, adopting the U-Net paradigm for progressive upsampling with skip connections at corresponding scales (1/32 → 1/4 of the input resolution). Each upsampling step fuses high-level semantic cues from the encoder with finer local features, followed by scSE attention refinement to maximize spatial boundary fidelity and preserve semantic context.

2. Mix-Transformer Encoder Internals

The encoder accepts a resized RGB input of 512×512512 \times 512 pixels. Patch embedding is achieved through a 7×77 \times 7 convolution with stride 4, yielding an initial 128×128128 \times 128 feature map with C1=64C_1=64 channels. Subsequent downsampling uses 3×33 \times 3 convolutions with stride 2, producing hierarchical feature resolutions:

  • Stage 1: 128×128128 \times 128, C1=64C_1=64, 3 blocks
  • Stage 2: 64×6464 \times 64, C2=128C_2=128, 8 blocks
  • Stage 3: 32×3232 \times 32, C3=320C_3=320, 27 blocks
  • Stage 4: 16×1616 \times 16, C4=512C_4=512, 3 blocks

Within each block, LayerNorm precedes efficient self-attention (enabling long-range contextual mixing), MLP transformation, local depth-wise convolution, and residual addition. This design allows tokens at any position to attend globally throughout the hierarchy, encoding room topology, and reliably distinguishing walls from similar visual artifacts.

3. U-Net Decoder with scSE Attention

Departing from SegFormer’s lightweight MLP decoder, MitUNet utilizes a U-Net arrangement to reclaim spatial granularity lost during encoding. At each decoder stage:

  1. Upsample feature map spatially by 2×2\times.
  2. Concatenate with the matching encoder stage feature.
  3. Apply 3×33\times3 convolution for fusion.
  4. Refine through an scSE block.

The scSE module comprises:

  • Channel Squeeze & Excitation (cSE): sc=σ(FC(GlobalAvgPool(X)))∈RC×1×1\mathbf{s_c} = \sigma(\text{FC}(\text{GlobalAvgPool}(\mathbf{X}))) \in \mathbb{R}^{C \times 1 \times 1}, with channel-wise modulation.
  • Spatial Squeeze & Excitation (sSE): ss=σ(Conv1×1(X))∈R1×H×W\mathbf{s_s} = \sigma(\text{Conv}_{1\times1}(\mathbf{X})) \in \mathbb{R}^{1 \times H \times W}, with spatial enhancement.
  • Final output: scSE(X)=X′+X′′\text{scSE}(\mathbf{X}) = \mathbf{X}' + \mathbf{X}'' with X′=X⊗sc\mathbf{X}' = \mathbf{X} \otimes \mathbf{s_c} and X′′=X⊗ss\mathbf{X}'' = \mathbf{X} \otimes \mathbf{s_s}.

This attention paradigm hierarchically weights feature maps carrying high-level semantics and pinpoints pixel locations likely to compose wall boundaries, suppressing spurious activations and enhancing mask regularity.

Decoder + scSE Integration Pseudocode:

1
2
3
4
5
for i in {4,3,2,1}:
    x = Upsample(x)                  # ×2 in H,W
    skip = E[i-1]                    # feature at same resolution
    x = Conv3x3(Concat(x, skip))     # fuse
    x = scSE_block(x)                # channel+spatial excitation

4. Tversky Loss for Class Imbalance and Boundary Control

Wall segmentation in floor plans is challenged by severe class imbalance and the need to minimize boundary artifacts. MitUNet’s optimization utilizes an asymmetric Tversky loss [Salehi et al., 2017]:

LT(α,β)=1−TPTP+α FP+β FNL_{T}(\alpha,\beta) = 1 - \frac{TP}{TP + \alpha\,FP + \beta\,FN}

where TPTP, FPFP, FNFN are pixel-wise tallies. α\alpha penalizes false positives (over-segmentation), and β\beta penalizes false negatives (missed thin walls). The model’s conservativeness is controlled by setting α>β\alpha > \beta; experimental best results use α=0.6\alpha=0.6, β=0.4\beta=0.4, maximizing precision on boundaries while maintaining adequate recall.

5. Training Protocol and Implementation

Training utilizes the CubiCasa5k dataset (5,000 plans) for pre-training and a proprietary regional set (500 Russian/CIS-style plans) for domain-specific fine-tuning. Annotation masks are refined by:

  1. Extracting binary masks for doors and windows.
  2. Dilating openings by ∼30\sim30 px and subtracting from wall mask.
  3. Performing 5×55\times5 morphological closing.

Preprocessing includes resize/crop to 512×512512\times512 and normalization with ImageNet statistics. On-the-fly data augmentation via Albumentations incorporates geometric (random scaling, rotation, translation, perspective, elastic/grid deformation) and photometric (brightness/contrast, CLAHE, Gaussian/ISO noise) transformations.

Optimization uses batch size 4 (512² pixels), Adam optimizer (initial LR ≈ 1×10−41\times10^{-4} for pre-training, 1×10−51\times10^{-5} for fine-tuning), and ReduceLROnPlateau scheduler (factor=0.5, patience=3 on validation mIoU). Each experiment runs for 30 epochs with an 80/20 train/val split (seed=42), repeating each setup 3× for averaged results.

6. Quantitative Evaluation and Ablations

MitUNet achieves high performance metrics on the proprietary regional dataset, with key metrics (Recall, Precision, Accuracy, mIoU, VRAM peak) summarized:

Model Recall Precision Accuracy mIoU VRAM
MitUNet + Tversky(0.6/0.4) 92.25% 94.84% 98.85% 87.84% 1,751 MiB
UNet++ (Res50)+ Lovasz 93.10% 93.17% 98.76% 87.15% 3,311 MiB
SegFormer (MiT-b4)+Lovasz 93.35% 92.11% 98.68% 86.43% 1,270 MiB

Key ablation findings:

  • The Mix-Transformer encoder improves mIoU by ~0.7–1.4 points over pure CNNs.
  • The U-Net decoder with scSE sharpens boundaries versus bilinear upsampling.
  • Tversky loss with α=0.6,β=0.4\alpha=0.6, \beta=0.4 optimally balances high precision with recall for thin structures.

7. Limitations, Role, and Future Directions

MitUNet’s two-stage (pre-training + fine-tuning) pipeline necessitates extensive annotated data for domain adaptation. While the model exhibits high boundary fidelity, extreme wall geometries may induce staircasing artifacts. Future work includes integrating segmentation with contour-fitting for direct end-to-end vectorization and exploring learnable morphological operations to enforce topological regularity.

In automated 3D reconstruction workflows, MitUNet outputs crisp wall masks that interface directly with geometric extraction (e.g., Hough or graph-based), reducing post-processing demands in Scan-to-BIM environments. High boundary precision negates the need for substantial manual correction, streamlining the vectorization phase essential for 3D environment modeling (Parashchuk et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MitUNet.