OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Published 9 Apr 2026 in cs.CV | (2604.08209v1)

Abstract: To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a self-supervised RL post-training method that employs a proxy jigsaw task to restore the temporal order of shuffled video-audio clips.
It leverages advanced modality orchestration strategies, including clip-level modality masking, to overcome bi-modal shortcuts and enhance benchmark performance.
Ablation studies highlight that rigorous data filtering and accuracy-dependent reward adjustments are critical for robust omni-modal reasoning.

Motivation and Context

The transition from unimodal (typically textual or visual) LLMs to omni-modal paradigms, where models are tasked with simultaneous reasoning over temporally and semantically entangled video and audio streams, presents formidable data and supervision challenges. While reinforcement learning (RL) post-training has advanced complex reasoning in LLMs, the lack of scalable, high-quality annotated omni-modal data fundamentally limits transference of these advances. "OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering" (2604.08209) directly addresses this, proposing a self-supervised RL post-training method for omni-modal models that leverages large volumes of unannotated video-audio data through a proxy task: chronological reordering of shuffled clips. This approach is reinforced by a nuanced modality orchestration framework that overcomes key limitations found in naive multimodal proxy tasks, notably the modal shortcut phenomenon.

Framework and Methodology

OmniJigsaw structures its self-supervised post-training objective around restoring the original temporal order of shuffled video-audio clips. The framework adapts the classic jigsaw permutation task to the omni-modal domain, deploying the following pipeline:

Data Preprocessing and Filtering: Raw videos are segmented temporally into non-overlapping synchronized clips. To maximize task informativeness, a two-stage data filtering pipeline is introduced:
- Signal-based heuristic filtering ensures integrity and sufficient dynamism in both visual and acoustic streams by pruning samples with excessive staticity, silence, or poor signal-to-noise ratio.
- MLLM-based semantic screening (using Chain-of-Thought reasoning) discards instances lacking irreversible temporal transitions or coherent narrative flow.
  Figure 1: Data filtering pipeline for efficient adaptation of OmniJigsaw. Raw videos are subjected to signal-based filtering to ensure omni-modal integrity and dynamism, followed by semantic-based screening incorporating CoT reasoning for the assessment of narrative logic and state transitions.
Modality-Orchestration Strategies: To balance cross-modal integration and avoid trivial shortcut solutions, three distinct strategies are deployed:
- Joint Modality Integration (JMI): Provides all visual and acoustic data for each clip, naively expecting the model to leverage joint signals.
- Sample-level Modality Selection (SMS): Selects a global dominant modality per sample, suppressing the non-informative stream based on model-driven arbitration.
- Clip-level Modality Masking (CMM): Enforces an information bottleneck by masking less salient modalities at the clip level via an adaptive selection process, driving the model to perform fine-grained cross-modal integration during temporal reordering.
  Figure 2: Performance comparison of JMI, CMM, and uni-modal Jigsaw across video, audio, and omni-modal benchmarks. CMM's consistent superiority and JMI's performance degradation relative to uni-modal Jigsaw baselines compellingly support the "bi-modal shortcut phenomenon".

Empirical Results

The framework is instantiated on Qwen3-Omni-30B-A3B-Instruct, and extensively evaluated on fifteen benchmarks: eight for video, four for audio, and three for collaborative omni-modal reasoning. All results emphasize the importance of fine-grained data curation and modality orchestration.

Key findings:

CMM outperforms both JMI and SMS across all domains, especially on complex benchmarks such as MLVU-Test (+4.38), MMAR (+2.50), and OmniVideoBench (+1.70), indicating that enforced cross-modal bottlenecks catalyze mutual modality synergy and robust timeline reasoning.
JMI reveals a "bi-modal shortcut phenomenon", as performance degrades compared to uni-modal Jigsaw baselines (Figure 2). The model tends to default to the most informative single modality, under-utilizing complementary cues and impairing weaker modality representation learning.
Ablations demonstrate that data quality and reward function granularity are critical: The two-stage data filter eliminates samples where reordering is ill-posed, yielding higher downstream gains; introducing an accuracy-dependent reward discount catalyzes optimal sequence restoration by suppressing sub-optimal reasoning plateaus.
Figure 3: Comparison of CoT reasoning between CMM and JMI at training step 800. CMM (left) compels the model to jointly analyze visual and auditory cues by masking less salient modalities (dashed boxes) to create an information bottleneck, while JMI (right) exhibits a bi-modal shortcut by solely relying on linguistic cues and bypassing the necessary visual analysis.

Figure 4: Sub-capability performance comparison between CMM and SMS across fine-grained dimensions. CMM's predominant superiority over SMS highlights the efficacy of clip-level orchestration in capturing temporally non-uniform audio-visual cues, whereas sample-level arbitration often misses local high-value modal information.

Analysis of Modality-Orchestration Paradigms

The ablation and breakdown reveal several high-impact design observations:

Cross-Modal Bottlenecks as Regularizers: The CMM strategy's performance dominance is attributed to its ability to force the model beyond shortcut reasoning, compelling joint inference over temporally fragmented, partially masked signals. This aligns with the empirical finding that indiscriminate modal fusion (JMI) actually dilutes reasoning robustness for non-dominant modalities.
Granularity of Arbitration: Sample-level arbitration (SMS) yields gains in high SNR samples but falls short in cases where modal dominance shifts within a sample. Clip-level orchestration (CMM) offers finer control, enabling more context-sensitive exploitation of local cues.
Reward Schedule Engineering: The use of an accuracy-dependent discount factor in the reward function encourages persistent search for perfect timeline restoration, avoiding premature convergence.
Figure 5: Case 1: Indistinct State Changes.

Figure 6: Case 2: Disjointed Narrative.

Figure 7: Qualitative example of Sub-Scene Captioning. Comparison between the Qwen3-Omni-30B-A3B-Instruct baseline and its OmniJigsaw (CMM)-post-trained variant.

Implications and Future Directions

The results establish that a properly designed, annotation-free proxy task can drive non-trivial gains in omni-modal collaborative reasoning. Several implications arise:

Theoretical: These findings reinforce that information redundancy in multimodal proxy tasks can undermine cross-modal representation quality unless mitigated by active orchestration and dynamic information bottlenecks.
Practical: The two-stage filtering pipeline, paired with modality-dynamic masking, provides a scalable blueprint for self-supervised omni-modal post-training that is agnostic to costly manual annotations.
Future Work: Open avenues include curriculum and capability-aware data curation, exploration of more sophisticated reward shaping functions, extension to other proxy puzzles (e.g., variable-length or overlapping segments, spatio-temporal reordering), and systematization across additional model families and architectures.

Conclusion

OmniJigsaw delivers a scalable, annotation-free methodology for post-training omni-modal models, validated by consistent state-of-the-art performance across a suite of challenging video, audio, and collaborative reasoning benchmarks. Its data curation pipeline and clip-level modality-masking orchestration together counteract the modal shortcut phenomenon, catalyze robust temporal and cross-modal reasoning, and offer critical design insights for the development of next-generation self-supervised omni-modal AI systems.

Markdown Report Issue