Synesthesia of Vehicles (SoV)
- Synesthesia of Vehicles (SoV) is a cross-modal machine learning paradigm that generates one sensor modality from another to enhance vehicular perception.
- It employs techniques like latent diffusion, self-supervised student–teacher models, and LLM-based fusion to achieve precise predictive synthesis.
- SoV improves safety and communications by offering proactive sensing, redundancy in sensor data, and effective V2I channel modeling.
Synesthesia of Vehicles (SoV) designates a family of machine learning frameworks that establish predictive or generative relationships between heterogeneous sensing modalities aboard vehicles, inspired by the neurological phenomenon of human synesthesia. In vehicular contexts, SoV encompasses methods that, for example, synthesize tactile road excitations from visual data, infer communications channel parameters from multi-modal sensor inputs, or localize vehicles solely from audio through mappings learned from co-occurring video. These approaches transcend traditional sensor fusion by targeting cross-modal generation, prediction, or inference, typically leveraging deep generative models, cross-modal alignment, and unsupervised or self-supervised learning regimes. SoV methodologies contribute to proactive vehicle perception, redundancy in safety-critical systems, and enable advanced vehicle-to-infrastructure communications in AI-native vehicular platforms (Wang et al., 2 Feb 2026, Huang et al., 18 Sep 2025, Gan et al., 2019).
1. Foundational Principles and Formal Definitions
The SoV paradigm generalizes the “Synesthesia of Machines” (SoM) principle, wherein multi-modal sensors (visual, tactile, auditory, RF) and communication channels are treated as mutually informative modalities, and models are learned to map between their latent representations. Formally, for a set of raw sensing modalities , the goal is to learn shared latent embeddings , from which target modality outputs (e.g., tactile excitations, multipath channel states) are decoded: (Huang et al., 18 Sep 2025).
This cross-modal mapping may be used for:
- Prospective generation of tactile signals from images (Wang et al., 2 Feb 2026)
- Channel state prediction for wireless communications from multi-modal sensor suites (Huang et al., 18 Sep 2025)
- Inferring visual scene understanding using only audio (Gan et al., 2019)
The operational objective is not simply to correlate heterogeneous inputs, but to enable “hallucination” or synthesis of one modality from another, closing perception–action loops far in advance of direct sensor contact or in the absence of physical sensors.
2. Core Methodologies and Alignment Techniques
2.1. Cross-Modal Spatiotemporal Alignment
High-fidelity SoV requires precise linking of heterogeneous data streams. The SoV framework for tactile-from-vision synthesis (Wang et al., 2 Feb 2026) implements a four-stage alignment:
- Key-frame extraction: Visual frames are sampled at 30 fps, indexed by capture time .
- Road-segment marking: Pixels associated with the road map to world-coordinates with and where is the vehicle location (RTK GNSS).
- Temporal indexing: Forward speed is used to compute entry/exit times for each segment: , .
- Tactile extraction and resampling: Raw accelerometer signals (500 Hz) are time-aligned and interpolated to obtain fixed-length, speed-normalized tactile vectors matched to .
2.2. Generative and Predictive Modeling
SoV implementations generally deploy advanced generative or predictive architectures:
- Latent Diffusion (VTSyn): For the vision-to-tactile task (Wang et al., 2 Feb 2026), VTSyn employs a latent-diffusion with VAE encoding for tactile data, conditional on ResNet-18 image features. The denoiser, a U-Net backbone, predicts noise in the diffusion process, optimizing a noise MSE loss:
- Self-supervised Student–Teacher Models: In the audio–vision context (Gan et al., 2019), a vision “teacher” (e.g., YOLOv2) provides pseudo-labels for an audio “student” network, with a combined object-detection plus cross-modal feature-alignment loss for unsupervised learning:
- Pretrained LLM-based Multimodal Fusion: In RF channel modeling (Huang et al., 18 Sep 2025), multimodal features (from RGB-D, LiDAR, radar) are fused and mapped into a LLM (LLaMA 3.2) using low-rank adaptation and prompt engineering. Decoders produce LoS/NLoS classifications and regression over channel parameters.
These architectures enable SoV systems to predict or synthesize target modalities with strong generalization and sample efficiency.
3. Applications and Datasets
3.1. Proactive AV Perception via Visual–Tactile Synesthesia
The SoV platform described in (Wang et al., 2 Feb 2026) collected a real-vehicle, multi-modal dataset using a Geely Geometry E platform equipped with ZED 2 camera (30 fps, stereo), intelligent tire with ADXL375 accelerometer (500 Hz), and RTK GNSS unit (20 Hz). Over ≈10,000 paired samples, data spans six road types (asphalt, concrete, brick, gravel, dirt, muddy), in day and night. Each sample provides a spatially and temporally aligned pair for model training and evaluation.
3.2. Multi-modal V2I Channel Generation
SynthSoM-V2I (Huang et al., 18 Sep 2025) provides 211,395 snapshots of RGB-D images, LiDAR, mmWave radar, and Sionna RT channel data (up to 6 multipath components, LoS/NLoS labels), sampled at 33.33 ms intervals across four scenarios (urban/suburban, low/high traffic, sub-6 GHz and 60 GHz bands). The dataset supports knowledge transfer and few-shot learning for channel modeling under diverse V2I conditions.
3.3. Audio-Visual Vehicle Tracking
The Auditory Vehicle Tracking dataset (Gan et al., 2019) includes 3,243 annotated clips (≈3 h) with stereo audio (48 kHz) and video (24 fps, 1280×720) from smartphone and Shure MV88 microphone. Camera pose is randomly varied; bounding boxes are annotated for quantitative evaluation. Key metrics include [email protected]: 57.47% for the SoV-trained audio system vs. 79.54% for the vision oracle; robustness to low-light is demonstrably superior for the audio model.
| SoV Context | Sensor Modalities | Target Outputs | Example Model |
|---|---|---|---|
| Tactile Synthesis | Camera, Accelerometer | Vertical excitation signals | VTSyn (Diffusion) |
| Channel Modeling | RGB-D, LiDAR, Radar | Multipath channel (power, delay) | LLM4MG (LLaMA) |
| Object Localization | Stereo Audio, Video | Image-plane bounding boxes | Audio Student |
4. Quantitative Evaluation and Comparative Performance
4.1. Generation and Downstream Classification
In the visual-tactile task (Wang et al., 2 Feb 2026), VTSyn achieves the lowest Fréchet Inception Distance (FID: 53.27) and highest spectral similarity (freq_ssim: 0.7865) among baselines, with robust downstream performance in road-type classification (Accuracy: 0.641, F1: 0.604). Frequency-domain overlap is near-perfect below 20 Hz, the critical vibration band for safety applications.
Comparative Results (Tactile Synthesis)
| Model | RMSE | FID | freq_ssim | Road Class F1 |
|---|---|---|---|---|
| CGAN | 0.0755 | 280.75 | 0.7693 | 0.172 |
| CVAE | 0.0740 | 950.37 | 0.3920 | 0.035 |
| DiffWave | 0.0879 | 78.20 | 0.7725 | 0.599 |
| VTSyn | 0.1115 | 53.27 | 0.7865 | 0.604 |
Channel Modeling
LLM4MG achieves LoS/NLoS classification at 92.76% accuracy (vs. <85% for non-SoV models). Channel parameter NMSE for power/delay prediction reaches 0.099/0.032, supporting channel capacity estimations accurate to 96.20% of RT simulation. Baselines require full retraining for generalization; LLM4MG adapts with <1.6% target samples (Huang et al., 18 Sep 2025).
4.2. Audio-Only Localization
The Audio Student in (Gan et al., 2019) attains [email protected] = 57.47%, with further gains ([email protected] = 60.70%) after tracking post-processing, notably outperforming vision under adverse lighting (YOLOv2 [email protected] = 6.78%, StereoSoundNet = 30.88%). Feature alignment improves performance by ≈20 points over naive regression.
5. Implications for Vehicle Intelligence, Redundancy, and System Integration
SoV augments proactive perception, enabling vehicles to anticipate and adapt to environmental events before sensor contact (e.g., “seeing” vertical excitation from road geometry ahead by up to 20 m, or inferring RF fading events from multi-modal perception) (Wang et al., 2 Feb 2026, Huang et al., 18 Sep 2025). Synthetic streams can serve as virtual sensors for redundancy, online validation, or calibration of inertial estimators when hardware is absent or fails. In communications, SoV enhances channel-adaptive algorithms and supports design of robust, high-capacity vehicular links by providing fine-grained, sample-efficient multipath predictions conditioned on real-world sensor observations.
A plausible implication is that as vehicle sensor suites diversfy, SoV can provide cross-modal continuity and resilience, allowing continuous operation under sensor-specific degradations (e.g., visual occlusions, microphone failures) and more generalizable transfer across new domains with minimal retraining.
6. Limitations, Open Challenges, and Future Directions
Current SoV frameworks exhibit several limitations:
- Environmental diversity: Datasets omit adverse weather conditions (rain, snow), off-camber geometry, or aggressive maneuvers (Wang et al., 2 Feb 2026).
- Alignment: Camera calibration errors and lack of explicit spatial transformation (beyond geometric segment selection) can induce misalignments.
- Noise and environment: Audio-based SoV models degrade in highly noisy or multi-vehicle scenarios; reverberation and platform noise are not explicitly modeled (Gan et al., 2019).
- Generalization: Expanding to new domains (e.g., off-road, slippery terrain, novel V2I settings) still requires strategic data collection and fine-tuning.
Future directions include:
- Extending real-vehicle datasets to encompass diverse weather, lighting, and terrain (Wang et al., 2 Feb 2026).
- Tightly integrating SoV-generated signals into predictive control stacks, with direct evaluation on energy, comfort, and safety benchmarks.
- Real-time, closed-loop adaptation by leveraging sparse ground-truth readings for online fine-tuning and validation.
- Generalizing to additional modalities (e.g., thermal, ultrasound) and richer multi-agent settings (e.g., vehicle–pedestrian, vehicle–bicycle interactions).
- Deeper cross-modal architectures enabling joint inference and mutual compensation, particularly under partial sensor failures or adversarial environments (Gan et al., 2019).
7. Related Research Paradigms and Broader Context
SoV is situated at the intersection of cross-modal machine learning, generative modeling, and intelligent transportation systems. Its core premise of “modal synthesis” offers a complement to sensor fusion, by creating robust, context-adaptive information streams for perception and control. SoV also directly supports requirements posed by AI-native 6G vehicular communications, where scale, fidelity, and adaptability of channel modeling exceed the capacity of traditional ray tracing or statistical models (Huang et al., 18 Sep 2025).
A key distinction from classical fusion approaches is SoV's focus on generative cross-modal prediction. For instance, it can supply “virtual” tactile data before actual wheel–road interaction, or compensate for camera failures with audio-based localization using learned synesthetic mappings.
While already proven effective in automotive domains, SoV concepts may plausibly extend to robotics, surveillance, or any cyber-physical system where redundancy, proactivity, and multi-modal reasoning are critical under uncertainty.