Electromechanical Phase-Field Fracture Model
- Electromechanical phase-field fracture model is a computational framework that simulates crack evolution in electromechanically active materials using a continuous phase-field variable.
- It couples mechanical deformation with electrical effects, enabling analysis of fracture initiation and propagation under combined electromechanical loading.
- Numerical simulations based on the model enhance material reliability predictions and inform design improvements against failure.
V-RGBX denotes RGB multimodal vision in settings where a visible RGB observation is paired with an auxiliary modality , and, in the video setting, the paired streams are modeled jointly for perception, restoration, grounding, or editing. In the cited literature, includes thermal infrared, depth, event data, polarimetric sensing, and intrinsic scene-property maps such as albedo, normal, material, and irradiance. The term is also used as the title of an end-to-end intrinsic-aware video editing framework that unifies video inverse rendering, photorealistic video synthesis from intrinsic representations, and keyframe-based intrinsic-conditioned editing (Tu et al., 2023, Orfaig et al., 5 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025, Wu et al., 31 Jan 2026).
1. Scope and nomenclature
Within this literature, V-RGBX is not a single task. It is a family of formulations in which RGB is treated as the primary visible modality and supplies complementary evidence under conditions where RGB degrades, such as low light, glare, fog, rain, blur, fast motion, or geometric ambiguity. Thermal is the most recurrent modality in the surveyed work because it is repeatedly motivated as relatively insensitive to adverse illumination; depth is used for geometry and occlusion; event data for high dynamics and motion blur; polarimetric sensing for material and reflection cues; and intrinsic maps for physically interpretable editing and rendering control (Tu et al., 2023, Zhao et al., 31 Dec 2025, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Wu et al., 31 Jan 2026, Fang et al., 12 Dec 2025).
The literature also shows that V-RGBX spans both discriminative and generative paradigms. Discriminative instances include RGB-thermal video object detection, RGB-TIR visual grounding, RGB-X object detection, RGB-T single-object tracking, and RGBX multimodal grounding with MLLMs. Generative instances include joint RGB diffusion modeling for intrinsic decomposition and conditional generation, and intrinsic-aware video editing from structured scene channels (Tu et al., 2023, Zhao et al., 31 Dec 2025, Tang et al., 2022, Orfaig et al., 5 May 2025, Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025, Wu et al., 31 Jan 2026).
A recurring technical premise is asymmetry: RGB often contributes richer texture, color, and semantics, while the auxiliary modality supplies robustness or physical structure when visible appearance is unreliable. This suggests that V-RGBX is not merely “RGB plus another channel,” but a modality-aware design problem in which fusion, calibration, and conditioning strategy determine whether complementarity is actually realized (Tu et al., 2023, Tang et al., 2022, Jha et al., 30 May 2025).
2. Representative task formulations
The published formulations cover low-level, mid-level, and high-level vision. In some cases the task remains classical detection or tracking with an added modality; in others the formulation is explicitly redefined around RGB inputs.
| Task | modality | Representative formulation |
|---|---|---|
| Video object detection | Thermal | Paired RGB and thermal streams predict object categories and bounding boxes for each frame (Tu et al., 2023) |
| Visual grounding | Thermal infrared | Aligned plus a referring expression predicts (Zhao et al., 31 Dec 2025) |
| Low-light enhancement | Thermal | Low-light RGB plus corresponding thermal predicts an enhanced RGB image (Jha et al., 30 May 2025) |
| RGB-X object detection | Depth, polarimetric, infrared | Aligned RGB and 0 predict 2D boxes and classes (Orfaig et al., 5 May 2025) |
| Multimodal grounding with MLLMs | Thermal, depth, event | Query, RGB/X templates, and RGB/X search images predict 1 (Wu et al., 31 Jan 2026) |
| Intrinsic generation and editing | Albedo, normal, depth/disparity, irradiance, material | Joint RGB2 generation, decomposition, conditional generation, and video editing (Dirik et al., 19 Apr 2025, Fang et al., 12 Dec 2025) |
In RGB-thermal video object detection, the detector receives RGB 3 and thermal 4, extracts modality-specific features, aggregates local temporal information, fuses the modalities, and predicts categories and bounding boxes for frame 5. The paper explicitly positions this as RGBT VOD, an extension of conventional VOD in which paired RGB and thermal streams replace RGB-only video (Tu et al., 2023).
In RGB-TIR grounding, the visual input is replaced by an aligned pair 6, while the grounding objective remains localization of the referent described by a natural-language expression. RGBT-VGNet formulates this with separate RGB and TIR visual encoders, a frozen CLIP text encoder, cross-modal interaction, and a learnable regression token whose MLP predicts the final box (Zhao et al., 31 Dec 2025).
In intrinsic generative modeling, the formulation shifts from sensor complementarity to joint scene-factor modeling. PRISM represents a scene as 7, where the final implementation uses five modalities total: RGB, albedo, surface normal, depth represented as disparity and replicated to three channels, and diffuse irradiance. The model supports text-to-RGBX generation, RGB-to-X decomposition, and X-to-RGBX conditional generation (Dirik et al., 19 Apr 2025). V-RGBX video editing uses a related but video-specific formulation: 8 where 9 is inverse rendering into albedo, normal, material, and irradiance videos, and 0 is forward rendering conditioned on edited keyframes and an interleaved intrinsic-conditioning stream (Fang et al., 12 Dec 2025).
3. Fusion principles and architectural patterns
The dominant architectural pattern is middle fusion with modality-specific encoders followed by learned interaction. EINet is a dual-stream, multi-frame, middle-fusion detector built on YOLOX with Darknet53; RT-X Net extracts self-attended RGB and thermal features before multi-head cross-attention fusion; RGBX-DiffusionDet uses separate ResNet-50 and FPN branches for RGB and 1, then fuses paired feature levels with DCR-CBAM and aggregates proposal features with DMLAB; and RGBT-VGNet encodes RGB and TIR separately, adapts them asymmetrically, refines them under language guidance, and fuses them through cross-attention (Tu et al., 2023, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Zhao et al., 31 Dec 2025).
A second recurrent pattern is asymmetric cross-modal interaction. EINet’s most distinctive mechanism is erasure-based interaction: thermal is used to erase noisy RGB activations rather than RGB and thermal being treated symmetrically. The paper defines standard SiLU and a negative variant, with the intended negative SiLU 2, and uses inactive thermal features plus CBAM-derived foreground/background attention to suppress RGB background noise while retaining robustness through residual fusion. The design is explicitly one-directional: thermal helps erase RGB noise, but RGB does not erase thermal (Tu et al., 2023). RGBT-VGNet adopts a different asymmetry: because CLIP is pretrained on RGB, the thermal branch receives greater adaptation capacity through asymmetric LoRA ranks 3 in its Asymmetric Modality Adaptation module (Zhao et al., 31 Dec 2025). RT-X Net likewise states that thermal-induced feature maps refine and guide the RGB feature space in nighttime enhancement (Jha et al., 30 May 2025).
Late fusion remains important when paired multimodal supervision is limited. In RGB-T tracking, the DFAT study compares pixel-level, feature-level, and decision-level fusion under a SiamRPN++ pipeline with a ResNet-50 backbone and concludes that decision-level fusion is strongest. The proposed strategy fuses RGB and TIR outputs before softmax, uses adaptive classification weights computed from positive response statistics, and combines this with a linear template update. The reported ordering on VOT-RGBT2019 is decision-level 4 feature-level 5 pixel-level (Tang et al., 2022).
Generative V-RGBX uses different interaction mechanisms. PRISM concatenates latent tokens from all modalities into a shared Diffusion Transformer, with
6
and conditions on arbitrary modality subsets by overriding conditioned modality tokens during diffusion (Dirik et al., 19 Apr 2025). V-RGBX video editing introduces an interleaved conditioning mechanism in which one intrinsic modality is sampled per frame, then disambiguated by a temporal-aware intrinsic embedding: 7 with 8 empirically (Fang et al., 12 Dec 2025).
At the reasoning level, RGBX-R1 does not describe a new low-level fusion backbone in the visible text. Its main contribution is instead a Visual Modality Chain-of-Thought generated by an Understand–Associate–Validate prompting strategy, followed by Cold-Start Supervised Fine-Tuning and Spatio-Temporal Reinforcement Fine-Tuning. This shifts V-RGBX from feature fusion toward modality-aware reasoning and sequential grounding (Wu et al., 31 Jan 2026).
4. Benchmarks and evaluation regimes
The benchmark landscape is heterogeneous and strongly task-dependent. VT-VOD50 is presented as the first dedicated benchmark for RGBT video object detection; it contains 50 pairs of RGB-thermal video sequences, totaling 9449 RGBT image pairs, collected in real traffic scenarios, with 38 paired sequences for training and 12 for testing. The object taxonomy contains seven road-scene classes—car, van, electromobile, person, bus, truck, bicycle—and evaluation uses AP50, AP over IoU thresholds from 0.5 to 0.95 in steps of 0.05, FPS, and, in ablations, parameter count and GFLOPs (Tu et al., 2023).
RGBT-Ground is the first large-scale RGB-TIR visual grounding benchmark for complex real-world scenarios. It consists of 21,535 RGBT image pairs and 38,760 grounding instances derived from RefFLIR, RefM9FD, and RefMFAD. It includes scene-level labels over 13 scene types, environment-level labels for 4 illumination conditions and 4 weather conditions, object-level labels for size and occlusion, and three challenge-oriented test partitions: TestA for normal-size targets in normal/strong light, TestB for nighttime/weak-light/very-weak-light, and TestC for small-size targets. The reported metric is [email protected] (Zhao et al., 31 Dec 2025).
For low-light enhancement, LLVIP is used with synthetic low-light generation from visible images paired with thermal, and V-TIEE is introduced as a real-world visible-thermal enhancement evaluation dataset. V-TIEE contains 50 indoor and outdoor scenes, captured with a co-located visible camera and a co-located thermal camera combined using a gold dichroic mirror, with final spatial alignment refined using homography. LLVIP evaluation uses PSNR and SSIM, while V-TIEE uses LPIPS and SSIM (Jha et al., 30 May 2025).
RGBX-DiffusionDet evaluates on three multimodal detection settings: KITTI RGB-Depth, an RGB-polarimetric dataset with added bounding-box annotations, and M0FD RGB-Infrared. The method assumes aligned paired 2D modalities; KITTI depth is projected and completed, the polarimetric representation is encoded as
1
and M2FD infrared is replicated across three channels for backbone compatibility (Orfaig et al., 5 May 2025).
RGBX-Grounding extends the benchmark space to MLLM-based sequential grounding. It is built from VisEvent, DepthTrack, RGBD2022, RGBT234, and LasHeR, contains 7,432 samples and more than 58k or 59k images in the provided descriptions, and defines two evaluation settings: modality-known and modality-unknown. The grounding task is formulated as
3
with 4 typically 6, and is evaluated using [email protected] averaged across frames (Wu et al., 31 Jan 2026).
For intrinsic generative video, V-RGBX is trained on an internal synthetic dataset rendered from 127 Evermotion interior scenes totaling 171K frames, then evaluated on 85 videos from unseen Evermotion scenes and 85 videos from RealEstate10K using PSNR, SSIM, LPIPS, FVD, and Smoothness from VBench (Fang et al., 12 Dec 2025). PRISM, although image-based rather than video-based, uses InteriorVerse, HyperSim, and an internal commercial RGB-only interior dataset to study joint RGB5 generation, decomposition, and conditional synthesis (Dirik et al., 19 Apr 2025).
5. Empirical findings and canonical lessons
A central empirical lesson is that auxiliary modality input helps most when used through modality-aware interaction rather than naive fusion. In RGBT video object detection, baseline RGB with Darknet53 reaches AP50 41.28 and AP 21.27, baseline thermal reaches AP50 27.73 and AP 11.93, and full EINet reaches AP50 46.32, AP 23.96, at 92.59 FPS; the RGB-only version without multimodal interaction reaches AP50 44.04, AP 22.55, at 204.2 FPS. The same study reports that naive multimodal modifications of DFF, FGFA, SELSA, Temporal ROI Align, and TransVOD++ cause performance drops, and that the chosen local window 6 is the speed/accuracy sweet spot, while a five-frame window reaches AP50 44.65 but drops to 112 FPS from 204.2 FPS for the three-frame setting (Tu et al., 2023).
Visual grounding yields a parallel result. RGBT-Ground reports that multimodal RGB+TIR input improves over uni-modal RGB or TIR with an average gain of about 10% Acc@50, and RGBT-VGNet achieves strong results particularly on hard subsets. On RefM7FD test it reaches 74.34 against 72.35 for the best RGB+TIR adapted baseline, and on TestB it reaches 81.93, exceeding the best RGB-only model by 8.91 and the best adapted RGB+TIR baseline by 2.50. The component ablation shows the centrality of modality adaptation: on RefM8FD test, the variant with neither AMA nor LAVS scores 57.57, AMA alone scores 72.53, and AMA plus LAVS reaches 74.34 (Zhao et al., 31 Dec 2025).
In nighttime enhancement, RT-X Net outperforms visible-only low-light baselines on both LLVIP and V-TIEE. On LLVIP it reaches 27.75 PSNR and 0.85 SSIM, compared with 26.59 and 0.79 for Retinexformer. On V-TIEE it reaches 0.12 LPIPS and 0.71 SSIM, compared with 0.14 and 0.66 for Retinexformer. Its ablation establishes a graded fusion effect: RGB-only self-attention gives 26.42 / 0.73, thermal concatenation gives 27.15 / 0.80, and cross-attention gives 27.75 / 0.85 (Jha et al., 30 May 2025).
RGBX-DiffusionDet shows that a modality-general encoder can improve a diffusion detector without changing decoder complexity. AP9 rises from 67.0 to 69.2 on KITTI RGB-D, from 52.8 to 54.2 on RGB-P, and from 54.1 to 58.1 on M0FD RGB-IR. On M1FD, the method reaches 58.1 AP2 and 88.8 AP3, exceeding the compared fusion methods in the table. Under RGB corruption in RGB-D, gains remain large, including 48.6 to 56.3 AP4 under black occlusion and 43.9 to 52.7 under salt-and-pepper noise (Orfaig et al., 5 May 2025).
Tracking studies reinforce the importance of fusion placement. DFAT reports, on VOT-RGBT2019, a best pixel-level EAO of 0.3481, a best feature-level EAO of 0.3788, and a best decision-level EAO of 0.3986. The method also reports RGB-only EAO 0.3189, TIR-only 0.2826, and RGB+TIR baseline 0.3433, indicating that multimodal benefit exists but depends on calibrated decision fusion (Tang et al., 2022).
Generative V-RGBX exhibits a different but related pattern: joint modeling improves consistency and controllability. PRISM’s full joint model outperforms its unimodal variants on HyperSim, and its albedo5irradiance reconstruction experiment gives RMSE 0.0849, PSNR 22.38, and LPIPS 0.15, compared with RMSE 0.1299, PSNR 19.94, and LPIPS 0.16 for a single-channel variant. It also achieves the best reported FID when conditioning on depth or normal maps on the pseudo-labeled ImageNet validation split from ControlVAR (Dirik et al., 19 Apr 2025). V-RGBX video editing reports large gains over RGBX and DiffusionRenderer in forward rendering: RGBX gives PSNR 16.53, SSIM 0.7154, LPIPS 0.2417, and FVD 1037.15; V-RGBX without reference gives 21.48, 0.7908, 0.2064, and 401.62; and full V-RGBX gives 22.42, 0.7952, 0.1930, and 367.89. In RGB6RGB cycle consistency, it reaches PSNR 22.57 and FVD 367.61 on Evermotion, and PSNR 17.88 and FVD 633.76 on RealEstate10K (Fang et al., 12 Dec 2025).
At the MLLM level, RGBX-R1 argues that box supervision alone does not sufficiently teach non-RGB modality understanding. On RGBX-Grounding under modality-known evaluation, RGBX-R1-7B reaches 46.53 average [email protected], compared with 31.04 for Qwen2.5-VL-7B-sft. In the X-only setting, Qwen2.5-VL-7B-sft reaches 12.08 total, while RGBX-R1 reaches 24.55. The full MuST reward outperforms variants without modality-understanding reward or with mean-IoU reward replacing the spatio-temporal term, and the stage-wise study shows 19.64 for the cold-start model, 23.19 for RL alone, 32.74 for conventional SFT+ST-RFT, and 46.53 for full RGBX-R1-7B (Wu et al., 31 Jan 2026).
6. Limitations, misconceptions, and open directions
A common misconception is that adding an auxiliary modality automatically improves performance. The cited evidence does not support this. In RGBT video object detection, naive multimodal injection degrades several mainstream VOD methods; in low-light enhancement, thermal channel concatenation is weaker than cross-attention; and in RGB-T tracking, handcrafted pixel fusion and learned feature fusion remain weaker than calibrated decision fusion (Tu et al., 2023, Jha et al., 30 May 2025, Tang et al., 2022). A plausible implication is that V-RGBX is less a question of modality count than of modality interaction.
A second misconception is that the auxiliary modality can replace RGB. The detection literature explicitly contradicts this: with Darknet53 on VT-VOD50, RGB-only strongly exceeds thermal-only, and the grounding literature reports that RGB-only dominates in strong or normal light while TIR-only degrades less in weak or very weak light. The recurring conclusion is complementarity rather than substitution (Tu et al., 2023, Zhao et al., 31 Dec 2025).
Most systems depend on paired and reasonably aligned modalities. EINet relies on manually aligned RGB-thermal frames; RT-X Net assumes co-located and approximately aligned RGB-T inputs refined by homography; RGBX-DiffusionDet assumes aligned paired 2D modalities; and RGBX-Grounding relies on spatially and temporally aligned RGB/X pairs from source tracking datasets (Tu et al., 2023, Jha et al., 30 May 2025, Orfaig et al., 5 May 2025, Wu et al., 31 Jan 2026). The lack of dedicated misalignment robustness studies is therefore a cross-cutting limitation.
Dataset scale and domain breadth remain uneven. VT-VOD50 contains 50 sequence pairs, and its authors explicitly state plans for a larger dataset with more than 500 pairs of RGBT videos and richer challenge annotations (Tu et al., 2023). V-TIEE contains only 50 scenes (Jha et al., 30 May 2025). RGBT-Ground is built from converted RGB-T detection datasets rather than native grounding collection, though it substantially enlarges the task space (Zhao et al., 31 Dec 2025). V-RGBX video editing is trained only on indoor synthetic data rendered from Evermotion scenes and explicitly notes limitations from one-modality-per-frame conditioning and WAN backbone scalability (Fang et al., 12 Dec 2025). RGBX-R1 studies only thermal, depth, and event modalities, and leaves low-level handling of non-RGB images under-specified in the visible text (Wu et al., 31 Jan 2026).
Open directions in the cited work point toward broader multimodal coverage and richer control. RGBT-Ground’s appendix suggests referring expression segmentation, VQA, and cross-modal retrieval (Zhao et al., 31 Dec 2025). RT-X Net states future work will reduce model complexity for video processing and optimize real-time applications such as autonomous driving and robotics (Jha et al., 30 May 2025). V-RGBX editing suggests that more complex multi-attribute edits and longer videos remain challenging under its current interleaved conditioning scheme (Fang et al., 12 Dec 2025). More broadly, the surveyed literature suggests two durable research trajectories: modality-aware reasoning that explicitly models reliability and complementarity, and joint RGB7 generative formulations that treat auxiliary channels as first-class structured representations rather than secondary side inputs (Dirik et al., 19 Apr 2025, Wu et al., 31 Jan 2026).