MindCube: Interactive Emotional Sonification
- MindCube is a compact interactive device that studies and modulates emotional states through embodied musical interaction with advanced sensor inputs.
- It integrates a diverse sensor suite with low-latency BLE streaming to drive both handcrafted modular synthesis mappings and AI-driven generative audio pipelines.
- The platform demonstrates practical hybrid sonification by combining real-time sensor data with dual audio synthesis methods to achieve responsive emotion regulation.
The MindCube is a palm-sized (3.3 cm³) interactive device engineered to study and modulate emotional states through embodied musical interaction. Drawing external inspiration from commercially available “fidget” cubes, the MindCube incorporates a dense array of motion, tactile, and haptic sensors beneath a form factor familiar to stress-relief tools. Its hardware foundation, real-time multi-sensor data streaming, and dual sonification engines—one relying on modular synthesis with hand-crafted mappings and the other on deep generative models steered by user input—position it as a versatile platform for investigating emotion regulation and responsive musical systems (Liu et al., 22 Jun 2025).
1. Hardware Architecture and Data Flow
The MindCube’s architecture is centered around a Nordic nRF52832 BLE System-on-Chip (ARM Cortex-M4 core) executing Arduino-based firmware. The embedded sensor suite includes:
- 9-DoF ICM-20498 IMU (3-axis accelerometer, gyroscope, magnetometer, sampled via I²C)
- Four mechanical tactile switches (debounced electronically and arranged on a single face)
- Two-axis miniature joystick
- Thumb-wheel rolling disk (motion read via quadrature mouse-wheel encoder)
- Slide power switch, Li-Po charging port, and firmware programming port
- Integrated 100 mAh Li-Po battery and linear vibration motor (PWM-driven for haptic outputs)
Every 50 ms (20 Hz), the microcontroller aggregates all sensor signals, debounces/discretizes the state-change inputs, COBS-encodes the data packet, and transmits via BLE. A Python client establishes a host-side BLE connection, decodes incoming streams, and routes data to either the modular synthesis stack (VCV Rack) or to the AI sonification server for further processing (Liu et al., 22 Jun 2025).
2. Modular Synthesis: Hand-Crafted Mapping
The non-AI sonification approach employs a Python BLE→TCP bridge, forwarding 20 Hz CSV-formatted packets to a custom C++ module in the VCV Rack ecosystem. Each sensor signal is normalized to the –5 V to +5 V “CV” range and is mapped—often with lightweight sensor fusion or elementary arithmetic—onto musical synthesis parameters.
Mapping examples:
| Sensor/Input | Derived Musical Parameter |
|---|---|
| Roll angle θ(t) | Pitch modulation |
| θ via complementary filter: θ_acc = arctan2(a_y, a_z); θ_gyro ← θ_gyro + ω_x Δt; θ = α·θ_gyro + (1–α)·θ_acc | |
| θ remapped linearly | Filter cutoff frequency fc(t): fc(t) = f_min + (θ(t)/π)·(f_max–f_min) |
| Tilt φ(t) | LFO rate r_LFO(t) |
| Joystick j_x, j_y | Stereo pan: p(t) = tanh(β·j_x); modulation index: M(t) = M₀ + κ·j_y |
| Button i | Envelope gate: G_i(t) ∈ {0,1} |
| Wheel encoder N(t) | Step-sequencer advance: (step_prev + N(t)) mod N_steps |
Additional mappings include amplitude control and spatialization angle (Liu et al., 22 Jun 2025). The system affords low-latency (<10 ms), continuous control, and a “hands-on” workflow familiar to modular synthesizer users.
3. Generative AI Sonification Pipeline
The AI-driven sonification leverages a two-stage pipeline: variational autoencoding (VAE) of audio segments followed by latent diffusion modeling (LDM). The user’s micro-movements, encoded as multidimensional sensor time series, serve as real-time conditioning vectors in the generative chain.
Model Structure and Training
- Audio Representation: Short clips from Free Music Archive (8000 30s tracks) are encoded into a 4D latent using a VAE (β-VAE style objective, 177 training epochs).
- VAE Loss:
- Latent Diffusion: Latent vectors (length 512) further drive a Latent Diffusion Model, trained for 700 epochs using ε-prediction loss. Classifier-Free Guidance (CFG) conditions the synthesis on the root-mean-square (RMS) “activity” level of the MindCube sensors.
- Diffusion Step:
Guidance at generation is modulated by , computed each second from sensor RMS as:
where is the moving-window standard deviation for each sensor, and 0 are empirically chosen weights prioritizing IMU, joystick, button, and encoder signals.
Real-time Inference
A rolling one-second window of sensor data is transformed into the conditioning vector 1. Generation comprises 2 diffusion steps (with outpainting via tail seeding), and the resulting 3 is decoded by the VAE into a ≈23 s audio segment. Latency on an M3-Max MacBook Pro is approximately 1.05 s per segment (0.90 s for diffusion, 0.15 s for decoding), imposing a practical sensor reading and musical update interval of roughly 0.95 Hz (Liu et al., 22 Jun 2025).
4. Implementation, Optimization, and System Integration
System integration workflow:
| Stage | Technology/Protocol | Purpose |
|---|---|---|
| MindCube → Host | BLE (COBS-encoded packets) | Wireless sensor data transmission |
| Host → Processing | Python bridge | Packet decoding and routing |
| AI Server | PyTorch (LDM + VAE decoder) | Audio generation and synthesis |
| Output | System audio | Real-time musical output |
Optimization specifics: AdamW optimizer with learning rates ≈1e–4 and minibatches of 32; 50% conditional dropout augments CFG robustness. “Outpainting” of audio generations is realized by seeding the head of each new diffusion run with the last 4 latent samples from the preceding segment. The custom firmware and BLE protocol enable 20 Hz streaming for over 3 hours on a single 100 mAh charge, according to bench tests (Liu et al., 22 Jun 2025).
5. Evaluation, Observed Behaviors, and System Capabilities
Bench testing demonstrated system robustness: 20 Hz BLE streaming was sustained for over 3 hours per charge, non-AI sonification maintained sub-10 ms round-trip latency, and AI mappings yielded ≈1 s end-to-end latency perceived as musically responsive. Informal trials indicate:
- Users could intentionally steer the generative AI toward higher or lower RMS (energy) outputs by modulating the intensity of device manipulation.
- Hand-crafted mappings enabled precise, repeatable sculpting of synthesis parameters such as filter cutoff and rhythmic sequences.
Neither the modular nor the AI mapping alone produced reliable emotion regulation outcomes; the research suggests the greatest promise may lie in hybridizing both pipelines to combine immediacy with the richness of latent generative guidance (Liu et al., 22 Jun 2025).
6. Open Problems and Future Directions
Planned research directions for the MindCube platform include:
- Controlled user studies to correlate RMS sensor metrics with subjective or physiological emotional state (“RMS-emotion hypothesis”).
- Augmenting the conditioning signal 5 with explicit affective labels from self-report or physiological sensors, pursuing truly emotion-aware generation.
- Investigating on-device model compression, such as quantized diffusion, to further reduce audio generation latency below 500 ms.
- Expanding hybrid sonification paradigms: integrating AI-driven modulations into modular synthesis environments (e.g., live injection into VCV Rack).
A plausible implication is that by combining the MindCube’s multimodal control streams with modern generative models, future instruments might achieve a higher degree of responsiveness to users’ felt internal states (Liu et al., 22 Jun 2025).