MuraNet Multi-task Model

Updated 29 December 2025

The paper introduces MuraNet, a multi-task model that combines floor plan segmentation with object detection via an attention-based unified backbone.
It employs a novel MURA relation-attention module using parallel 3x3 convolutions to aggregate multi-scale context and enhance feature sharing.
Joint training of segmentation and detection tasks yields improved accuracy and faster convergence compared to U-Net and YOLOv3 baselines.

MuraNet is a multi-task neural network architecture designed for joint floor plan image segmentation and object detection, specifically targeting the recognition of walls, doors, and windows in architectural diagrams. The model integrates an attention-driven unified backbone—termed MURA (Multi-scale Relation Attention)—with specialized decoder branches for segmentation and detection. Through task co-training and attention-based feature sharing, MuraNet achieves improved accuracy, convergence speed, and sample efficiency over comparable single-task networks (Huang et al., 2023).

1. MURA Relation-Attention Backbone

The core architectural innovation in MuraNet is the four-stage MURA encoder, which hierarchically encodes input images via successive spatial down-sampling and multi-scale relational attention.

At each stage, input features of shape $X \in \mathbb{R}^{C \times H' \times W'}$ are down-sampled by a stride-2 $3 \times 3$ convolution followed by batch normalization (BN) and residual blocks integrating the MURA module. The encoder stages produce feature maps at spatial sizes $H/4 \times W/4$ , $H/8 \times W/8$ , $H/16 \times W/16$ , and $H/32 \times W/32$ .

MURA Module Details

Queries, keys, and values are computed via $1 \times 1$ convolutions:

$Q = W_q * X,\quad K = W_k * X,\quad V = W_v * X$

with $W_q, W_k, W_v \in \mathbb{R}^{C \times C}$ .

The scaled dot-product attention map is given by:

$A = \text{softmax}\left(\frac{\text{reshape}(Q) \cdot \text{reshape}(K)^{T}}{\sqrt{d_k}}\right)$

where $d_k = C$ and reshaping aligns channel dimensions.

Instead of a single global attention, the feature map $X$ is processed by parallel $3 \times 3$ convolutions, each with different context (via depth or dilation), and fused using a skip-add scheme:

$\begin{align*} B_1 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X))) \ B_2 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X + B_1))) \ B_3 &= \text{ReLU}(\text{BN}(\text{Conv}_{3\times 3}(X + B_1 + B_2))) \ Y &= X + B_1 + B_2 + B_3 \end{align*}$

This design enables aggregation of multi-scale context and structural relations, avoiding the shape bias induced by large-strip convolutions.

2. Task-Specific Decoder Branches

MuraNet decodes the shared MURA backbone representations into predictions for wall segmentation and object (door/window) detection, with distinct architectures per task.

Segmentation Decoder

Utilizes outputs from encoder stages 2–4 (excluding stage 1 to avoid low-level noise).
Employs a U-Net-inspired upsampling path, designated as a symmetric “Hamburger” decoder; each block comprises bilinear upsampling by a factor of 2, a $3 \times 3$ conv-BN-ReLU stack, and skip-add fusion with encoder features.
Yields $C$ -way class probability maps via a final $1 \times 1$ convolution, with $C=2$ (background, wall).
Loss: Weighted cross-entropy with class weights

$\omega_i = \frac{\hat{H} - N_i}{\sum_{j=0}^{C-1}(\hat{H} - N_j)}$

where $N_i$ is the count of pixels of class $i$ in the training set, $\hat{H}$ is the total pixel count. An optional soft-IoU term is offered.

Detection Head

Based on YOLOX’s “decoupled head” (distinct regression and classification branches).
For each of stages 2, 3, and 4, channels are reduced to 256 via $1 \times 1$ convolution.
Parallel branches: One for bounding box regression (4 coordinates plus 1 IoU-aware confidence), one for multinomial object classification.
Anchor settings are compatible with YOLOv3, optimizing for the two relevant object classes.

Losses:

Box regression: mean squared error in $(x, y, \sqrt{w}, \sqrt{h})$ .
Objectness: binary cross-entropy.
Classification: binary cross-entropy at locations where objects are present.

The total detection loss is $L_{det} = L_{box} + L_{obj} + L_{cls}$ .

3. Joint Multi-Task Training Procedure

The total loss is a direct sum: $L_{total} = L_{seg} + L_{det}$ .
Optimization is performed via SGD (momentum 0.937, weight decay $5 \times 10^{-4}$ ).
Learning rate follows cosine decay from $10^{-2}$ to $10^{-6}$ over 1000 epochs; a linear warm-up from $10^{-4}$ over 50 epochs.
Training is distributed on ten RTX 3090 GPUs with a batch size of 10; no dropout is used.
Models are trained from scratch, with no ImageNet pre-training.

4. Experimental Setup and Quantitative Results

Experiments are conducted on the CubiCasa5k dataset, comprising 5,000 floor-plan images annotated for walls (segmentation) and doors/windows (detection). Images are resized to $1536 \times 1536$ pixels using area interpolation for downsampling and bicubic interpolation for upsampling. The dataset includes 4,000 training, 500 validation, and 500 test samples. Data augmentation and further pre-training strategies are not reported.

Performance Metrics

Wall Segmentation (IoU %) and Detection (AP):

Model	Wall IoU	Epochs to Converge	AP50 Doors	AP50 Windows	mAP@[.5:.95]
U-Net base	65.5	6	–	–	–
MuraNet base	78.4	8	91.2	92.2	53.8
YOLOv3 base	–	–	89.2	90.1	49.5

Ablation studies demonstrate that:

Removal of MURA reduces wall IoU from 78.4 to 75.3.
Adding MURA to U-Net raises IoU from 65.5 to 75.1.
Replacing the decoupled detection head with a coupled version lowers AP50 from 91.7 to 90.0.

5. Architectural Analysis and Core Insights

The unified attention-based backbone exploits explicit geometric regularities in floor plan datasets, where walls, doors, and windows often co-occur in predictable arrangements. The skip-add aggregation in MURA enables learning of cross-scale and cross-channel relationships with a compact set of $3\times3$ convolutional kernels, mitigating the limitations of single global attention and avoiding the curvature bias of elongated convolutions. This enhances the generalization capacity for both segmentation and detection tasks. Joint feature sharing in the backbone improves sample efficiency and accelerates convergence during training—MuraNet achieves 99.9% of final accuracy with less training time than comparable single-task models.

The decoupled detection head, adapted from YOLOX, addresses the known conflict between bounding box regression and categorical classification by spatially separating these optimization targets, which results in higher detection AP and faster convergence.

6. Comparative Performance and Ablation

Results on CubiCasa5k establish consistent gains for MuraNet over baseline U-Net and YOLOv3 approaches. When comparing identical backbone depths and input modalities:

MuraNet outperforms U-Net in wall IoU by 12.9–1.3 points, depending on encoder stage depth.
MuraNet boosts detection mAP by 4.3 (AP50) and 4.3 (mAP@[.5:.95]) over YOLOv3.
The MURA module provides significant enhancement in segmentation for both U-Net and SegNeXt backbones when inserted, as evidenced by a 9.6 point gain for U-Net.

7. Limitations and Prospects for Future Research

The application of relation attention in MuraNet is confined to the backbone, leaving head architectures and loss computation agnostic to attention mechanisms. Future work could expand attention-based modeling to decoder branches or loss design (e.g., attention-guided IoU weighting). Additionally, the impact of customized data augmentation or domain-adapted pre-training strategies remains unexplored and offers a path to improved performance and generalization. MuraNet's architecture demonstrates that attention-augmented multi-task learning can resolve sample inefficiency and slow convergence associated with traditional single-task pipelines in symbolic-structural vision domains such as architectural floor plans (Huang et al., 2023).

Markdown Upgrade to Chat

References (1)

MuraNet: Multi-task Floor Plan Recognition with Relation Attention (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuraNet Multi-task Model.