Papers
Topics
Authors
Recent
2000 character limit reached

JavisInst-Omni: Multi-Modal Instruction Dataset

Updated 2 January 2026
  • JavisInst-Omni is a multimodal instruction dataset with over 200K dialogue trajectories supporting temporally coherent audio-video reasoning.
  • It leverages diverse modalities including audio, video, and text, synthesized via GPT-4o to provide fine-grained, synchrony-aware comprehension and generation tasks.
  • The dataset underpins the JavisGPT framework, driving substantial performance improvements in multi-modal large language model benchmarks.

JavisInst-Omni is a large-scale, high-quality, multi-modal instruction dataset, designed to advance the capabilities of unified multimodal LLMs (MLLMs) for temporally coherent joint audio-video (JAV) understanding and generation tasks. Developed and curated to support the JavisGPT framework, it provides over 200,000 GPT-4o-synthesized audio-video-text dialogues covering diverse and fine-grained synchrony-aware comprehension, generation, and multi-turn conversational scenarios. Its architecture and methodology directly inform state-of-the-art approaches in audio-video multi-modal machine learning and benchmarking (Liu et al., 28 Dec 2025).

1. Dataset Composition and Scope

JavisInst-Omni comprises more than 200,000 multimodal dialogue trajectories, resulting in approximately 600,000 single-turn samples encompassing unimodal (audio-, video-, or image-only), bimodal (audio-text, video-text), and joint audio-video-text instruction types. The dataset includes:

  • Audio + Text QA: 55,000 samples
  • Video + Text QA: 60,000 samples
  • Image + Text QA: 20,000 samples
  • Audio-Video + Text QA: 95,000 samples
  • Audio-Video captioning: 20,000 samples
  • Text-to-Audio-Video generation: 150,000 samples
  • JavisInst-Und (synchrony-aware AV-QA): 110,000 samples
  • JavisInst-Gen (generation & multi-turn): 90,000 samples

Instruction types span comprehension tasks (entity/relation/global-level AV-QA, single-turn and joint), conditional and proactive AV generation (text→AV, V→A, A→V, etc.), captioning, and rich multi-turn dialogues, including composite QA or understand-then-generate sessions. This broad coverage ensures robust evaluation and training for multimodal generative and reasoning models.

2. Data Construction, Curation, and Verification

Underlying audio-video pairs are sourced from benchmark data such as TAVGBench (1.5 million captioned AV clips) for pretraining/fine-tuning, with QA samples drawn from VideoLLaMA2 (95,000 AV-QA), LLaVA-Video-178K (60,000 video QA), and related datasets. Editing pairs leverage InstV2V with additional synthetic audio via FoleyCrafter. All clips maintain inherent audio-video synchronization; AV-extend samples use 1–2 s overlaps for temporal alignments.

Dialogues, instructions, and QA are synthesized by GPT-4o using a comprehensive set of 3,000+ templates spanning ten synchrony-aware QA and eleven AV generation types. Approximately 20% of text samples are paraphrased with GPT-4o-mini for linguistic diversity. Human verification is performed on at least 95% of the data, with annotators checking for logical correctness, modality matching, and explicit synchrony. Erroneous/ambiguous items are either eliminated or rewritten. This rigorous curation ensures both scale and fidelity in synchrony-aware, semantically diverse multi-modal data.

3. Annotation Schema and Task Typology

Each sample adopts a structured, schema-driven JSON format. Key fields include input modalities (audio/video URLs, first frames, timestamps), instruction text, answer or generation target, category/type tag, optional choices, and natural language explanations. Two schemas predominate:

  • JavisInst-Und (synchrony-aware AV-QA):

Contains fields for video/audio URLs, queried time intervals, explicit reasoning category (e.g., "relation-temporal"), multi-choice options, ground truth answer, and rationale.

  • JavisInst-Gen (generation/multi-turn):

Carries prior dialogue, textual instructions for AV creation or editing, and target AV clip references.

Categories are explicitly labeled for synchrony demands: entity alignment, relation (spatial/temporal/causal), global atmosphere, emotion, theme, etc. Generation instructions employ both formal and colloquial phrasing, with conditional AV manipulations (e.g., video→audio, audio→video extensions) and multi-turn sequences.

4. Quantitative Characteristics and Diversity Measures

The dataset features balanced sampling across comprehension and generation sub-tasks. For JavisInst-Und, ten categories each hold approximately 11,000 examples. JavisInst-Gen encompasses eleven instruction types (six conditional, three multi-turn, two register):

Und Category Samples
Existence 11,000
Alignment 11,000
Grounding 11,000
Counting 11,000
Spatial relation 11,000
Temporal relation 11,000
Causal relation 11,000
Emotion 11,000
Atmosphere 11,000
Theme 11,000
Total 110,000

Entropy of instruction type is measured as H=i=1CpilogpiH = -\sum_{i=1}^C p_i \log p_i with pi1/Cp_i \approx 1/C for near-uniform split, giving Hlog102.3H \approx \log 10 \approx 2.3 nats (3.3\approx 3.3 bits) for JavisInst-Und. This suggests substantial scenario diversity. Generation types in JavisInst-Gen exhibit similar uniformity, supporting robust generalization and evaluation.

5. Benchmarking Role and Empirical Impact

JavisInst-Omni underpins the three-stage JavisGPT training regime:

  1. MM-PreTrain: audio/text and audio–text/vision alignment (600 K audio-text, 1.5 M captions).
  2. AV-FineTune: synchronized AV comprehension and generation (360 K captions, 360 K triplets).
  3. MM-InstTune: instruction-tuning on JavisInst-Omni (≈600 K instructions).

Ablation studies demonstrate critical dependence on JavisInst-Omni:

  • Omitting JavisInst-Und reduces AV comprehension accuracy by ~1–2 points.
  • Omitting JavisInst-Gen lowers JavisScore synchrony from 0.157 to ~0.135.
  • Joint comprehension + generation (with JavisInst-Omni) yields a +2.3 FVD improvement and +0.018 JavisScore versus separate training streams.

A plausible implication is that the fine-grained synchrony-aware and multi-turn design of JavisInst-Omni directly translates into enhanced MLLM performance for temporally coherent, multi-modal understanding and generation (Liu et al., 28 Dec 2025).

6. Significance and Usage Outlook

JavisInst-Omni represents the first GPT-4o–curated instruction corpus at scale for cross-modal, synchrony-grounded audio-video reasoning and generation. Its balanced scenario type coverage, rigorous curation, and explicit annotation of temporal, spatial, and causal alignment positions it as a standard for both supervised model development and fair evaluation in audio-video MLLM research. Typical uses span benchmark evaluation, instruction-tuning for emergent LLMs, and as a foundation for further dataset expansion targeting even more complex, temporally entangled multi-modal reasoning workflows.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to JavisInst-Omni.