SyncFusion Module for Multimodal Synchrony

Updated 2 January 2026

SyncFusion Module is a system that ensures precise audio-video-text synchrony through large-scale, multi-turn instruction dialogues and GPT-4o guided curation.
It applies rigorous alignment protocols, using overlapping temporal segments and human-in-the-loop verification to maintain logical and cross-modal consistency.
Integration of the module in training pipelines leads to measurable improvements in AV comprehension accuracy and synchrony alignment, validated by enhanced FVD and synchrony scores.

JavisInst-Omni denotes a large-scale, multimodal instruction dataset for spatio-temporally synchronized audio–video–text reasoning and generation, constructed to support the training and evaluation of models such as JavisGPT for joint audio-video (JAV) comprehension and generation tasks (Liu et al., 28 Dec 2025). JavisInst-Omni comprises over 200,000 multi-turn dialogue trajectories (roughly 600,000 single-turn samples) spanning single- and multi-turn audio, video, and caption-based tasks. The dataset is curated via GPT-4o prompt generation and extensive human-in-the-loop verification to ensure logical correctness, cross-modal relevance, and alignment.

1. Dataset Scope and Modalities

JavisInst-Omni captures a comprehensive spectrum of multimodal instruction types across audio, video, image, and text. The main annotation types are:

Audio + Text QA (55 K samples)
Video + Text QA (60 K samples)
Image + Text QA (20 K samples)
Audio-Video + Text QA (95 K samples)
Audio-Video Captioning (20 K samples)
Text→Audio-Video Generation (150 K samples)
JavisInst-Und (synchrony-aware AV-QA): 110 K samples
JavisInst-Gen (AV generation & multi-turn): 90 K samples

Distinct categories encompass single-turn QA (unimodal and joint), instruction-guided generation, and composite multi-turn dialogues, supporting entity-, relation-, and global-level synchrony comprehension. Tasks span entity alignment, temporal/causal reasoning, AV extension/editing, and conditional generation, thereby enabling a high-entropy, diverse testbed for spatio-temporal multimodal modeling.

2. Construction Pipeline and Verification

The dataset construction leverages the following curation protocol:

Source Data: Generation and captioning pairs are primarily sourced from TAVGBench (1.5 M captioned AV clips). QA samples are adapted from VideoLLaMA2, LLaVA-Video-178K, LLaVA-OneVision, and InstV2V (with FoleyCrafter-synthesized audio).
Synchrony Preservation: AV pairs are guaranteed to be aligned, validated through overlapping segment sampling for AV-extension (1–2 s overlap).
GPT-4o Generation: Every instruction and multi-turn scenario is synthesized via few-shot GPT-4o templates, spanning 10 QA and 11 AV-generation types (approx. 3,000 instruction templates for text-to-AV tasks, with 20% additional paraphrasing by GPT-4o-mini).
Human-in-the-Loop: ≥95% of outputs are spot-checked for logic, modality match, and synchrony. Invalid or ambiguous cases are either rejected or re-generated, ensuring high data fidelity.

3. Annotation Schema

Representative JSON schemas highlight the detailed and rigorous annotation:

JavisInst-Und (Synchrony-Aware QA):

{
  "id": "und_0000123",
  "modality_inputs": {
    "video_url": ".../vid123.mp4",
    "audio_url": ".../aud123.wav",
    "timestamps": [0.0, 4.0]
  },
  "category": "relation‐temporal",
  "instruction": "Does the honking sound occur before the pedestrian crosses the street?",
  "options": {"A":"Yes","B":"No"},
  "answer": "A",
  "explanation": "The horn blares at 1.2 s, the crossing happens at 1.8 s."
}

JavisInst-Gen (Proactive Generation):

{
  "id": "gen_0000456",
  "prior_dialogue": [
    {"speaker":"User", "text":"What kind of music do you like?"},
    {"speaker":"Agent","text":"I prefer calm piano tracks."}
  ],
  "instruction": "Now make a 5 s video of a painter at work with soft piano audio.",
  "modality_inputs": {},
  "target": {
    "video_url": ".../gen456.mp4",
    "audio_url": ".../gen456.wav"
  }
}

Annotations specify temporally aligned inputs (with timestamps when questions reference explicit events), dialogue context for multi-turn, granular “category” fields (e.g., “entity‐alignment,” “relation‐spatial”), and in-depth explanations for synchrony-aware queries.

4. Quantitative Properties and Diversity

JavisInst-Omni achieves high scenario and instruction diversity by near-uniform sampling across rich subcategories.

Subset	Category	Samples
JavisInst-Und	Existence	11,000
...	...	...
JavisInst-Und	Theme	11,000
JavisInst-Gen	11 types, ~8 K/type	~90,000

Instruction entropy for JavisInst-Und is $H \approx \log 10 \approx 2.3$ nats ( $\approx 3.3$ bits) given ten subcategories and near-uniform $p_i$ . This broad coverage ensures extensive exposure to cross-modal phenomena, ranging from fine-grained counting and alignment to global atmosphere attribution.

During model training, explicit alignment objectives are imposed:

$\mathcal{L}_{\mathrm{align}}^s = \|\hat{s} - s\|_2^2$

where $\hat{s}$ and $s$ denote predicted and ground truth spatio-temporal embeddings, reflecting JavisInst-Omni’s emphasis on temporal synchrony.

5. Integration in Model Development and Benchmarks

JavisInst-Omni is the core dataset for the three-stage instruction-tuning pipeline in JavisGPT:

MM-PreTrain: Audio/text and AV caption pre-alignment using JavisDiT (600 K audio–text, 1.5 M captions).
AV-FineTune: Synchronized AV comprehension and generation with 720 K AV-labeled samples.
MM-InstTune: Large-scale instruction tuning exclusively on JavisInst-Omni (≈600 K instructions).

Ablation analyses confirm JavisInst-Omni’s criticality: exclusion of JavisInst-Und reduces AV comprehension accuracy (AVQA, MU-AVQA) by ~1–2 points, while removal of JavisInst-Gen drops JavisScore synchrony on JavisBench-mini from 0.157 to ~0.135. Joint comprehension and generation training with JavisInst-Omni yields +2.3 FVD improvement and +0.018 JavisScore compared to modality-disjoint tuning.

6. Significance and Research Impact

JavisInst-Omni represents a pivotal resource for synchrony-aware MLLM research, providing instruction diversity, robust cross-modal alignment, and scalable annotation quality. Its scale—over 200 K dialogues / 600 K samples—uniquely supports both comprehension and generation benchmarking, particularly in complex, temporally synchronized settings. The dataset’s integration in model development directly improves synchrony alignment and overall multimodal generation quality, as quantified by FVD and JavisScore on both standard and mini-benchmark subsets (Liu et al., 28 Dec 2025).

A plausible implication is that the curated approach—combining large-scale GPT-4o prompting with human-in-the-loop verification—enables consistent logic and modality matching required for advanced instruction-following MLLMs.

7. Data Availability and Usage

JavisInst-Omni’s annotated formats, categorization scheme, and collection recipes support direct integration with existing MLLM pipelines. Its consistency with TAVGBench and alignment with widely used AV-QA, video, and audio datasets facilitate benchmarking, ablation, and generative research. The dataset’s instructional and scenario richness enables its adoption in both baseline and state-of-the-art system development for multimodal understanding and generation tasks (Liu et al., 28 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SyncFusion Module.