Multi-Mono-Hydra (M2H) Multi-Task Framework
- M2H is a multi-task learning framework that leverages a lightweight ViT-based DINOv2 backbone to perform complementary spatial predictions from a single image.
- It employs a window-based cross-task attention module to exchange localized features across tasks such as semantic segmentation, depth estimation, edge detection, and surface normal estimation.
- The design achieves real-time performance with low computational overhead, validated on benchmarks like NYUDv2, and supports applications in 3D scene graph construction and robotics.
Multi-Mono-Hydra (M2H) is a multi-task learning framework developed to facilitate efficient and complementary monocular spatial perception, integrating multiple dense prediction tasks—semantic segmentation, depth estimation, edge detection, and surface normal estimation—from a single image input. M2H employs a window-based cross-task attention module that orchestrates the structured exchange of features among pixel-level tasks, enabling improved consistency and performance while maintaining computational efficiency suitable for real-time deployment on edge hardware. Building on a lightweight Vision Transformer (ViT) DINOv2 backbone, M2H is validated across standard benchmarks and real-world settings, offering a scalable foundation for 3D scene graph construction and dynamic spatial understanding (Udugama et al., 20 Oct 2025).
1. Architectural Foundation and Framework Design
M2H leverages a lightweight ViT-based DINOv2 backbone to extract multi-scale token representations from the monocular input image. The backbone’s feature extraction is followed by Multi-Scale Token Reassembly (MSTR), which reconstitutes tokens into spatial feature maps, and Multi-Scale Fusion (MSF), which further refines these to generate preliminary features for each task.
A dual-path decoder structure is central to M2H:
- The local path, executed via the Window-Based Cross-Task Attention (WMCA) module, specializes in localized inter-task information exchange.
- In parallel, the Global Gated Feature Merging (GGFM) block aggregates global context by applying operations such as global average pooling and depth-wise convolution.
Specialized decoder heads are deployed to produce outputs for each task, maintaining distinct task-specific representations alongside fused complementary features.
2. Window-Based Cross-Task Attention Mechanism
The WMCA module is designed for efficient interaction across tasks at a localized scale. For each task (edge, normal, semantic, depth), feature maps of dimension are partitioned into non-overlapping windows of size , yielding sequences of flattened tokens ().
Upon Layer Normalization, the tokens from all four tasks within a window are concatenated (), then processed by a multi-head attention block:
- Attention calculation follows .
- Residual connection integration: .
- FFN update: .
Tokens are split post-attention and reshaped to yield enriched, localized feature maps for each task, now containing context derived from cross-task interactions while preserving details specific to their original stream.
3. Backbone Efficiency and Real-Time Deployment
The adoption of the ViT-based DINOv2 small backbone delivers robust multi-scale feature extraction, supporting both granular detail and broad contextual capture critical across pixel-level spatial tasks. The backbone is optimized for real-time inference, enabling rapid feature generation with a reduced parameter count.
The overall system, including WMCA and GGFM modules, achieves:
- Maintained or superior accuracy compared to larger, more computationally demanding backbones.
- Suitable parameter and computational profiles for deployment on restricted hardware, including laptop-grade GPUs.
4. Task Integration and Multi-Task Prediction
M2H unifies:
- Semantic segmentation
- Depth estimation
- Edge detection
- Surface normal estimation
By facilitating complementary learning—features derived from one task guiding and enhancing prediction for others—M2H establishes prediction consistency and reduces redundancy often present in traditional independent or naively shared encoder-decoder models. The design achieves both improved mIoU for segmentation and decreased RMSE for depth estimation, as well as competitive results in edge and normal estimation.
5. Benchmark Performance Analysis
The performance of M2H is established through evaluations on NYUDv2, Hypersim, and Cityscapes:
| Dataset | Semantic mIoU (Δ vs SOTA) | Depth RMSE Reduction (%) | Frame Rate (FPS) |
|---|---|---|---|
| NYUDv2 | +3.4% | 13% | ~30 (RTX 3080) |
| Hypersim | Improved | 33% (vs Scale Depth-NK) | — |
| Cityscapes | +1.2 (vs SwinMTL) | ~3.5% | — |
Performance surpasses prior multi-task models and leading single-task baselines, and M2H is validated for deployment on hardware subject to memory and compute constraints (Udugama et al., 20 Oct 2025).
6. Real-World Validation and Applicability
In practical scenarios, M2H underpins 3D scene graph construction from monocular imagery—its predictions feed into spatial perception systems operating in dynamic environments, including those with additional sensory streams (e.g., IMU data from the ITC dataset). Enhanced spatial predictions facilitate accurate mapping, supporting use cases in autonomous navigation, robotics, and augmented reality.
7. Computational Efficiency and Design Principles
M2H’s efficiency is achieved via:
- Feature map window partitioning in WMCA, which confines attention computation to localized regions, mitigating the quadratic scaling cost of global attention.
- GGFM’s lightweight global fusion.
- Parameter reduction and GFLOPs minimization in variants such as M2H-small, yielding optimal predictive quality within restricted computational budgets.
This principled design ensures that M2H provides scalable multi-task spatial perception for real-time edge deployment, maintaining robust accuracy and throughput.