Papers
Topics
Authors
Recent
2000 character limit reached

MuraNet Multi-task Model

Updated 29 December 2025
  • The paper introduces MuraNet, a multi-task model that combines floor plan segmentation with object detection via an attention-based unified backbone.
  • It employs a novel MURA relation-attention module using parallel 3x3 convolutions to aggregate multi-scale context and enhance feature sharing.
  • Joint training of segmentation and detection tasks yields improved accuracy and faster convergence compared to U-Net and YOLOv3 baselines.

MuraNet is a multi-task neural network architecture designed for joint floor plan image segmentation and object detection, specifically targeting the recognition of walls, doors, and windows in architectural diagrams. The model integrates an attention-driven unified backbone—termed MURA (Multi-scale Relation Attention)—with specialized decoder branches for segmentation and detection. Through task co-training and attention-based feature sharing, MuraNet achieves improved accuracy, convergence speed, and sample efficiency over comparable single-task networks (Huang et al., 2023).

1. MURA Relation-Attention Backbone

The core architectural innovation in MuraNet is the four-stage MURA encoder, which hierarchically encodes input images via successive spatial down-sampling and multi-scale relational attention.

At each stage, input features of shape XRC×H×WX \in \mathbb{R}^{C \times H' \times W'} are down-sampled by a stride-2 3×33 \times 3 convolution followed by batch normalization (BN) and residual blocks integrating the MURA module. The encoder stages produce feature maps at spatial sizes H/4×W/4H/4 \times W/4, H/8×W/8H/8 \times W/8, H/16×W/16H/16 \times W/16, and H/32×W/32H/32 \times W/32.

MURA Module Details

  • Queries, keys, and values are computed via 1×11 \times 1 convolutions:

Q=WqX,K=WkX,V=WvXQ = W_q * X,\quad K = W_k * X,\quad V = W_v * X

with Wq,Wk,WvRC×CW_q, W_k, W_v \in \mathbb{R}^{C \times C}.

A=softmax(reshape(Q)reshape(K)Tdk)A = \text{softmax}\left(\frac{\text{reshape}(Q) \cdot \text{reshape}(K)^{T}}{\sqrt{d_k}}\right)

where dk=Cd_k = C and reshaping aligns channel dimensions.

  • Instead of a single global attention, the feature map XX is processed by parallel 3×33 \times 3 convolutions, each with different context (via depth or dilation), and fused using a skip-add scheme:

B1=ReLU(BN(Conv3×3(X))) B2=ReLU(BN(Conv3×3(X+B1))) B3=ReLU(BN(Conv3×3(X+B1+B2))) Y=X+B1+B2+B3\begin{align*} B_1 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X))) \ B_2 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X + B_1))) \ B_3 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X + B_1 + B_2))) \ Y &= X + B_1 + B_2 + B_3 \end{align*}

This design enables aggregation of multi-scale context and structural relations, avoiding the shape bias induced by large-strip convolutions.

2. Task-Specific Decoder Branches

MuraNet decodes the shared MURA backbone representations into predictions for wall segmentation and object (door/window) detection, with distinct architectures per task.

Segmentation Decoder

  • Utilizes outputs from encoder stages 2–4 (excluding stage 1 to avoid low-level noise).
  • Employs a U-Net-inspired upsampling path, designated as a symmetric “Hamburger” decoder; each block comprises bilinear upsampling by a factor of 2, a 3×33 \times 3 conv-BN-ReLU stack, and skip-add fusion with encoder features.
  • Yields CC-way class probability maps via a final 1×11 \times 1 convolution, with C=2C=2 (background, wall).
  • Loss: Weighted cross-entropy with class weights

ωi=H^Nij=0C1(H^Nj)\omega_i = \frac{\hat{H} - N_i}{\sum_{j=0}^{C-1}(\hat{H} - N_j)}

where NiN_i is the count of pixels of class ii in the training set, H^\hat{H} is the total pixel count. An optional soft-IoU term is offered.

Detection Head

  • Based on YOLOX’s “decoupled head” (distinct regression and classification branches).
  • For each of stages 2, 3, and 4, channels are reduced to 256 via 1×11 \times 1 convolution.
  • Parallel branches: One for bounding box regression (4 coordinates plus 1 IoU-aware confidence), one for multinomial object classification.
  • Anchor settings are compatible with YOLOv3, optimizing for the two relevant object classes.

Losses:

  • Box regression: mean squared error in (x,y,w,h)(x, y, \sqrt{w}, \sqrt{h}).
  • Objectness: binary cross-entropy.
  • Classification: binary cross-entropy at locations where objects are present.

The total detection loss is Ldet=Lbox+Lobj+LclsL_{det} = L_{box} + L_{obj} + L_{cls}.

3. Joint Multi-Task Training Procedure

  • The total loss is a direct sum: Ltotal=Lseg+LdetL_{total} = L_{seg} + L_{det}.
  • Optimization is performed via SGD (momentum 0.937, weight decay 5×1045 \times 10^{-4}).
  • Learning rate follows cosine decay from 10210^{-2} to 10610^{-6} over 1000 epochs; a linear warm-up from 10410^{-4} over 50 epochs.
  • Training is distributed on ten RTX 3090 GPUs with a batch size of 10; no dropout is used.
  • Models are trained from scratch, with no ImageNet pre-training.

4. Experimental Setup and Quantitative Results

Experiments are conducted on the CubiCasa5k dataset, comprising 5,000 floor-plan images annotated for walls (segmentation) and doors/windows (detection). Images are resized to 1536×15361536 \times 1536 pixels using area interpolation for downsampling and bicubic interpolation for upsampling. The dataset includes 4,000 training, 500 validation, and 500 test samples. Data augmentation and further pre-training strategies are not reported.

Performance Metrics

Wall Segmentation (IoU %) and Detection (AP):

Model Wall IoU Epochs to Converge AP50 Doors AP50 Windows mAP@[.5:.95]
U-Net base 65.5 6
MuraNet base 78.4 8 91.2 92.2 53.8
YOLOv3 base 89.2 90.1 49.5

Ablation studies demonstrate that:

  • Removal of MURA reduces wall IoU from 78.4 to 75.3.
  • Adding MURA to U-Net raises IoU from 65.5 to 75.1.
  • Replacing the decoupled detection head with a coupled version lowers AP50 from 91.7 to 90.0.

5. Architectural Analysis and Core Insights

The unified attention-based backbone exploits explicit geometric regularities in floor plan datasets, where walls, doors, and windows often co-occur in predictable arrangements. The skip-add aggregation in MURA enables learning of cross-scale and cross-channel relationships with a compact set of 3×33\times3 convolutional kernels, mitigating the limitations of single global attention and avoiding the curvature bias of elongated convolutions. This enhances the generalization capacity for both segmentation and detection tasks. Joint feature sharing in the backbone improves sample efficiency and accelerates convergence during training—MuraNet achieves 99.9% of final accuracy with less training time than comparable single-task models.

The decoupled detection head, adapted from YOLOX, addresses the known conflict between bounding box regression and categorical classification by spatially separating these optimization targets, which results in higher detection AP and faster convergence.

6. Comparative Performance and Ablation

Results on CubiCasa5k establish consistent gains for MuraNet over baseline U-Net and YOLOv3 approaches. When comparing identical backbone depths and input modalities:

  • MuraNet outperforms U-Net in wall IoU by 12.9–1.3 points, depending on encoder stage depth.
  • MuraNet boosts detection mAP by 4.3 (AP50) and 4.3 (mAP@[.5:.95]) over YOLOv3.
  • The MURA module provides significant enhancement in segmentation for both U-Net and SegNeXt backbones when inserted, as evidenced by a 9.6 point gain for U-Net.

7. Limitations and Prospects for Future Research

The application of relation attention in MuraNet is confined to the backbone, leaving head architectures and loss computation agnostic to attention mechanisms. Future work could expand attention-based modeling to decoder branches or loss design (e.g., attention-guided IoU weighting). Additionally, the impact of customized data augmentation or domain-adapted pre-training strategies remains unexplored and offers a path to improved performance and generalization. MuraNet's architecture demonstrates that attention-augmented multi-task learning can resolve sample inefficiency and slow convergence associated with traditional single-task pipelines in symbolic-structural vision domains such as architectural floor plans (Huang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MuraNet Multi-task Model.