ROMA: Real-Time Omni-Multimodal Assistant
- ROMA is a unified real-time streaming system that synchronizes audio, video, and text to overcome temporal granularity mismatches.
- It features a decoupled speak head that autonomously determines when to initiate responses, supporting proactive alerts and reactive QAs.
- The two-stage curriculum training and chunked temporal alignment contribute to its state-of-the-art performance across 12 benchmarks.
RoMa, introduced as ROMA: Real-time Omni-Multimodal Assistant, is a unified streaming audio-video-text system designed for both reactive and proactive interaction in real time. It addresses three stated obstacles in streaming omni-multimodal understanding: the temporal granularity mismatch between dense audio and discrete video frames, incomplete modality support in prior streaming systems, and the absence of an autonomous mechanism for deciding when to intervene during an unfolding stream. ROMA processes continuous inputs as synchronized multimodal units, introduces a decoupled speak head for response initiation, and is trained with a two-stage curriculum over curated streaming data. Extensive experiments across 12 benchmarks report state-of-the-art performance on proactive tasks while remaining competitive in reactive question answering (Tian et al., 15 Jan 2026).
1. Problem formulation and system scope
ROMA is defined around a unified setting in which a single model must both answer user queries and autonomously monitor live streams for alerting or narration. The problem is not limited to post hoc understanding of recorded clips. Instead, the model operates causally over a stream and must support online decision making under latency constraints. The paper frames the difficulty in terms of three concrete issues: audio is dense and continuous while video arrives as discrete frames; many existing systems are either video-centric or speech-centric and therefore not truly unified; and most systems remain reactive rather than deciding on their own when to speak (Tian et al., 15 Jan 2026).
The model’s target behaviors are organized into two modes. In reactive interaction, ROMA answers questions over streaming audio-video inputs. In proactive interaction, it produces event-driven alerts or real-time narration without waiting for an explicit query. This joint formulation is central to the system design, because timing and content generation are treated as separate but coordinated problems. The architecture therefore does not reduce streaming multimodal understanding to token generation alone; it also includes an explicit mechanism for response triggering.
A plausible implication is that ROMA treats streaming multimodal assistance as a control problem as much as a representation problem. The paper’s design choices—synchronized multimodal units, causal KV caching, and a separate timing head—are all organized around this premise rather than around offline captioning or standard multimodal QA.
2. Streaming representation and temporal alignment
ROMA builds on an omni-modal LLM backbone with Qwen2.5-Omni format compatibility, while the vision and audio encoders are frozen during fine-tuning (Tian et al., 15 Jan 2026). The stream is segmented into one-second units. Within each unit, video and audio are packed together on a shared timeline using special BOS/EOS markers: 5 Frames are sampled at 2 fps, each frame is resized so that total pixels are ≤ 65,536, and audio is chunked at 40 ms resolution.
The key alignment mechanism is Chunked TMRoPE (Time-aligned Multimodal RoPE). ROMA adapts Qwen2.5-Omni’s TMRoPE to streaming by assigning cumulative positional IDs across units. Video tokens inside a one-second unit share the same temporal ID, whereas audio tokens retain fine-grained temporal IDs at 40 ms steps. The paper gives the conceptual positional scheme as
for a video token in unit , and
for audio token within the same unit (Tian et al., 15 Jan 2026). Boundary alignment is enforced by giving vision_bos and audio_bos the same base position ID, after which subsequent units continue the global timeline from the previous unit’s maximum.
This representation is intended to preserve cross-modal correspondence despite granularity mismatch. Multi-frame video is temporally aggregated by the vision encoder, while audio retains its finer temporal structure. A persistent KV cache maintains stream history, and at each step the model encodes only the current unit while attending to cached past states. The paper states that this preserves causality, avoids future access, and enables low-latency operation (Tian et al., 15 Jan 2026).
3. Decoupled response initiation: the speak head
ROMA’s most distinctive architectural component is the speak head, a lightweight two-layer MLP that runs in parallel to the language-model head. Its purpose is to decide when the system should start speaking, independently of the problem of what it should say. The paper formulates this decoupling as a means of avoiding inference conflicts between timing and generation (Tian et al., 15 Jan 2026).
The speak head receives a learnable weighted combination of the last hidden layers, with by default:
and produces a trigger probability
where is an activation such as GELU and 0 is the trigger probability at time step 1, identified with the unit index (Tian et al., 15 Jan 2026). ROMA speaks when 2; otherwise it remains silent and continues ingesting the stream. The threshold 3 is task-dependent. The paper reports thresholds around 0.97–0.985 for narration and task-specific values for alert settings, sometimes combined with sliding-window smoothing to reduce transient fluctuations.
Timing supervision is framed as binary classification over stream time steps. With ground-truth trigger labels 4, the timing loss is a weighted BCE:
5
The positive weight is used to mitigate label sparsity, and the reported default is 6 in most experiments (Tian et al., 15 Jan 2026).
This design separates temporal triggering from token generation rather than trying to infer both from a single decoding signal. The ablation evidence is correspondingly strong. On dynamic proactive alert tasks, removing the speak head reduces StreamingBench PO from 53.60 to 12.00, OVO-Bench REC from 33.81 to 6.46, and OVO-Bench CRR from 35.42 to 0.00. On YouCook2 narration, removing the speak head reduces F1 from 35.21 to 9.25, which the paper presents as demonstrating the necessity of decoupled timing (Tian et al., 15 Jan 2026).
4. Training data and curriculum
ROMA is trained on a curated streaming corpus that unifies proactive and reactive formats. The paper reports three principal data sources: Online proactive (27K), Online narration (109K), and Reactive QA (540K) (Tian et al., 15 Jan 2026). Online proactive data is built from DiDeMo, OOPS, and Charades-STA, reformulated into prompts such as “Alert me when [event] happens.” Online narration data comes from MMDuetIT, COIN, YouCook2, and ActivityNet, with training configured to generate only at segment transitions rather than as dense captioning. Reactive QA draws from InternVid, CogStream, and other datasets including Egoplan, AVQA, TimeChat-Online, and ViSpeak. To unify modality, text queries are synthesized into speech and aligned with streaming units.
Training proceeds in two stages. Stage 1 (Streaming template alignment) adapts the model to the multimodal unit format using reactive QA data. For an answer token sequence 7, the language-model loss is
8
All encoders are frozen; only the remaining parameters 9 are fine-tuned (Tian et al., 15 Jan 2026).
Stage 2 (Time-aware decision making) trains the speak head with the weighted BCE timing objective while mixing a small portion of Stage-1 QA samples to stabilize language generation. The joint objective is
0
where 1 is computed only on the mixed QA samples and 2 balances timing and generation (Tian et al., 15 Jan 2026). Proactive samples are formatted as multi-turn dialogues so that multiple triggers can occur in a single stream.
The implementation details are unusually explicit. Training uses LLaMA-Factory, sequence length 32K, 32 H20 GPUs, and global batch size 512 (Tian et al., 15 Jan 2026). These details place ROMA within the current lineage of LLM-based multimodal fine-tuning rather than within bespoke streaming architectures that retrain encoders or build specialized memory modules from scratch.
5. Evaluation suite, protocols, and empirical results
A substantial part of ROMA’s contribution is evaluative standardization. The paper reorganizes fragmented streaming benchmarks into two settings—proactive streaming interaction and reactive QA—so that alerting, narration, and QA can be compared under a unified protocol (Tian et al., 15 Jan 2026).
| Setting | Benchmarks | Metrics |
|---|---|---|
| Proactive alert, static | QVHighlights; Charades-STA | mAP, HIT@1; [email protected], [email protected] |
| Proactive alert, dynamic | OmniMMI (PA); StreamingBench (PO); OVO-Bench (CRR, REC) | success if trigger time falls within ground-truth interval |
| Real-time narration | YouCook2; OVO-Bench SSR | F1, BERTScore, GPT-4o-based scoring |
| Reactive QA | OVO-Bench; StreamingBench; Video-MME (no subtitles); EgoSchema | accuracy; GPT-4o exact-match protocol |
On proactive alerts (static), ROMA reports 53.7 mAP / 53.0 HIT@1 on QVHighlights, outperforming MMDuet (31.3 / 49.6) and VTG-LLM (16.5 / 33.5). On Charades-STA, it reports 44.3 [email protected] / 19.9 [email protected], improving over MMDuet (42.4 / 18.0) (Tian et al., 15 Jan 2026).
On dynamic proactive alerts, the reported scores are 37.50 on OmniMMI PA, 53.60 on StreamingBench PO, 35.42 on OVO-Bench CRR, and 33.81 on OVO-Bench REC. On real-time narration, ROMA reaches F1 35.21, BERTScore 0.83, GPT-4o score 0.39 on YouCook2, and F1 14.54, BERTScore 0.83, GPT-4o score 0.42 on OVO-Bench SSR (Tian et al., 15 Jan 2026).
Reactive QA remains competitive rather than uniformly dominant. The paper states that ROMA leads in several OVO-Bench Real-time Visual Perception categories, including OCR 63.09, ATR 68.10, FPD 69.31, and OJR 58.15, and is strong in Backward Tracing, including EPM 55.89 and ASI 47.30. On full-modality QA with spoken queries, ROMA reports 33.30 on Video-MME (no subtitles), outperforming Qwen2.5-Omni (20.50), VITA-1.5 (28.56), and MiniCPM-o (19.37), while scoring 55.40 on EgoSchema, competitive with Qwen2.5-Omni 58.40 (Tian et al., 15 Jan 2026).
The throughput profile is also part of the system argument. Under the streaming protocol, the average encoding time per unit is 0.3697 s. The paper uses a pipelined approximation in which unit 3 is processed while unit 4 is being acquired, and generation is capped at 25 tokens per segment (~1 s), with longer outputs continued in subsequent units (Tian et al., 15 Jan 2026). This suggests that the model’s real-time claim is grounded not only in accuracy metrics but also in a concrete latency budget.
6. Positioning, limitations, and significance
ROMA is situated against two neighboring lines of work. Relative to video-centric streaming LLMs such as VideoLLM-Online, Flash-VStream, and memory/KV-cache methods, it adds synchronized audio and a proactive timing mechanism rather than focusing only on long-horizon memory or efficiency. Relative to omni-modal models such as MiniCPM-o, Qwen2.5/3-Omni, and Stream-Omni, it explicitly models the proactive “when to speak” decision through the speak head and aligns streaming audio with video at chunk-level temporal positions (Tian et al., 15 Jan 2026). The paper also emphasizes that its evaluation suite standardizes proactive and reactive modes that prior benchmarks often treat inconsistently.
The stated limitations are equally important. ROMA remains susceptible to signal degradation and audio-video asynchrony. Extremely long-horizon dependencies (hours) are still constrained by context windows and memory budgets. The paper further notes unresolved efficiency vs quality trade-offs under strict resource limits, mentioning adaptive sampling and smarter KV management as directions for further optimization (Tian et al., 15 Jan 2026).
The ethical section is brief but specific: proactive monitoring poses privacy risks, the model is intended for research, and human oversight is necessary given possible hallucinations and biases (Tian et al., 15 Jan 2026). This is not ancillary, because a system whose central capability is autonomous intervention in live streams raises questions that do not arise in purely reactive multimodal QA.
In research terms, ROMA’s central claim is not simply that one more omni-modal assistant can process audio, video, and text. Rather, it shows that unified real-time omni-modal understanding can be organized around three coupled mechanisms: synchronized multimodal units for alignment, a decoupled speak head for timing, and a streaming curriculum that separately adapts format and responsiveness. The empirical pattern—state-of-the-art proactive results together with competitive reactive performance—suggests that explicit timing decisions are a structural requirement for real-world streaming assistants rather than a secondary engineering detail (Tian et al., 15 Jan 2026).