JavisInst-Omni: Multi-Modal Instruction Dataset
- JavisInst-Omni is a multimodal instruction dataset with over 200K dialogue trajectories supporting temporally coherent audio-video reasoning.
- It leverages diverse modalities including audio, video, and text, synthesized via GPT-4o to provide fine-grained, synchrony-aware comprehension and generation tasks.
- The dataset underpins the JavisGPT framework, driving substantial performance improvements in multi-modal large language model benchmarks.
JavisInst-Omni is a large-scale, high-quality, multi-modal instruction dataset, designed to advance the capabilities of unified multimodal LLMs (MLLMs) for temporally coherent joint audio-video (JAV) understanding and generation tasks. Developed and curated to support the JavisGPT framework, it provides over 200,000 GPT-4o-synthesized audio-video-text dialogues covering diverse and fine-grained synchrony-aware comprehension, generation, and multi-turn conversational scenarios. Its architecture and methodology directly inform state-of-the-art approaches in audio-video multi-modal machine learning and benchmarking (Liu et al., 28 Dec 2025).
1. Dataset Composition and Scope
JavisInst-Omni comprises more than 200,000 multimodal dialogue trajectories, resulting in approximately 600,000 single-turn samples encompassing unimodal (audio-, video-, or image-only), bimodal (audio-text, video-text), and joint audio-video-text instruction types. The dataset includes:
- Audio + Text QA: 55,000 samples
- Video + Text QA: 60,000 samples
- Image + Text QA: 20,000 samples
- Audio-Video + Text QA: 95,000 samples
- Audio-Video captioning: 20,000 samples
- Text-to-Audio-Video generation: 150,000 samples
- JavisInst-Und (synchrony-aware AV-QA): 110,000 samples
- JavisInst-Gen (generation & multi-turn): 90,000 samples
Instruction types span comprehension tasks (entity/relation/global-level AV-QA, single-turn and joint), conditional and proactive AV generation (text→AV, V→A, A→V, etc.), captioning, and rich multi-turn dialogues, including composite QA or understand-then-generate sessions. This broad coverage ensures robust evaluation and training for multimodal generative and reasoning models.
2. Data Construction, Curation, and Verification
Underlying audio-video pairs are sourced from benchmark data such as TAVGBench (1.5 million captioned AV clips) for pretraining/fine-tuning, with QA samples drawn from VideoLLaMA2 (95,000 AV-QA), LLaVA-Video-178K (60,000 video QA), and related datasets. Editing pairs leverage InstV2V with additional synthetic audio via FoleyCrafter. All clips maintain inherent audio-video synchronization; AV-extend samples use 1–2 s overlaps for temporal alignments.
Dialogues, instructions, and QA are synthesized by GPT-4o using a comprehensive set of 3,000+ templates spanning ten synchrony-aware QA and eleven AV generation types. Approximately 20% of text samples are paraphrased with GPT-4o-mini for linguistic diversity. Human verification is performed on at least 95% of the data, with annotators checking for logical correctness, modality matching, and explicit synchrony. Erroneous/ambiguous items are either eliminated or rewritten. This rigorous curation ensures both scale and fidelity in synchrony-aware, semantically diverse multi-modal data.
3. Annotation Schema and Task Typology
Each sample adopts a structured, schema-driven JSON format. Key fields include input modalities (audio/video URLs, first frames, timestamps), instruction text, answer or generation target, category/type tag, optional choices, and natural language explanations. Two schemas predominate:
- JavisInst-Und (synchrony-aware AV-QA):
Contains fields for video/audio URLs, queried time intervals, explicit reasoning category (e.g., "relation-temporal"), multi-choice options, ground truth answer, and rationale.
- JavisInst-Gen (generation/multi-turn):
Carries prior dialogue, textual instructions for AV creation or editing, and target AV clip references.
Categories are explicitly labeled for synchrony demands: entity alignment, relation (spatial/temporal/causal), global atmosphere, emotion, theme, etc. Generation instructions employ both formal and colloquial phrasing, with conditional AV manipulations (e.g., video→audio, audio→video extensions) and multi-turn sequences.
4. Quantitative Characteristics and Diversity Measures
The dataset features balanced sampling across comprehension and generation sub-tasks. For JavisInst-Und, ten categories each hold approximately 11,000 examples. JavisInst-Gen encompasses eleven instruction types (six conditional, three multi-turn, two register):
| Und Category | Samples |
|---|---|
| Existence | 11,000 |
| Alignment | 11,000 |
| Grounding | 11,000 |
| Counting | 11,000 |
| Spatial relation | 11,000 |
| Temporal relation | 11,000 |
| Causal relation | 11,000 |
| Emotion | 11,000 |
| Atmosphere | 11,000 |
| Theme | 11,000 |
| Total | 110,000 |
Entropy of instruction type is measured as with for near-uniform split, giving nats ( bits) for JavisInst-Und. This suggests substantial scenario diversity. Generation types in JavisInst-Gen exhibit similar uniformity, supporting robust generalization and evaluation.
5. Benchmarking Role and Empirical Impact
JavisInst-Omni underpins the three-stage JavisGPT training regime:
- MM-PreTrain: audio/text and audio–text/vision alignment (600 K audio-text, 1.5 M captions).
- AV-FineTune: synchronized AV comprehension and generation (360 K captions, 360 K triplets).
- MM-InstTune: instruction-tuning on JavisInst-Omni (≈600 K instructions).
Ablation studies demonstrate critical dependence on JavisInst-Omni:
- Omitting JavisInst-Und reduces AV comprehension accuracy by ~1–2 points.
- Omitting JavisInst-Gen lowers JavisScore synchrony from 0.157 to ~0.135.
- Joint comprehension + generation (with JavisInst-Omni) yields a +2.3 FVD improvement and +0.018 JavisScore versus separate training streams.
A plausible implication is that the fine-grained synchrony-aware and multi-turn design of JavisInst-Omni directly translates into enhanced MLLM performance for temporally coherent, multi-modal understanding and generation (Liu et al., 28 Dec 2025).
6. Significance and Usage Outlook
JavisInst-Omni represents the first GPT-4o–curated instruction corpus at scale for cross-modal, synchrony-grounded audio-video reasoning and generation. Its balanced scenario type coverage, rigorous curation, and explicit annotation of temporal, spatial, and causal alignment positions it as a standard for both supervised model development and fair evaluation in audio-video MLLM research. Typical uses span benchmark evaluation, instruction-tuning for emergent LLMs, and as a foundation for further dataset expansion targeting even more complex, temporally entangled multi-modal reasoning workflows.