Kairos: A Native World Model Stack for Physical AI

Published 15 Jun 2026 in cs.AI and cs.CV | (2606.16533v2)

Abstract: World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

Abstract PDF Upgrade to Chat

Authors (24)

First 10 authors:

Summary

The paper introduces a unified physical AI architecture that integrates world understanding, generation, and prediction for robust long-horizon modeling.
It employs a hybrid linear attention mechanism combining SWA, DSWA, and GLA to achieve efficient inference and maintain high fidelity over extended sequences.
It leverages a cross-embodiment data curriculum that progressively aligns open-world observations, human demonstrations, and robotic traces for self-evolutionary learning.

Kairos: A Native World Model Stack for Physical AI

Motivation and Overview

Kairos introduces a unified world-model stack purpose-built for Physical AI, advancing from traditional generative models toward infrastructure capable of self-evolutionary learning and real-world deployment. The system is architected to address four entrenched bottlenecks in world modeling: (1) integrating heterogeneous data for broad physical intelligence; (2) persistent, long-horizon state maintenance; (3) cross-domain perception-action grounding for embodied control; and (4) deployment under strict hardware constraints. The core objective is to synthesize robust, transfer-ready physical understanding directly into the backbone of a natively scalable world-action model.

Figure 1: Motivation for Kairos as an operational infrastructure for future self-evolving Physical AI, beyond traditional generative paradigms.

Unified Architecture: Understanding, Generation, Prediction

Kairos departs from disjointed modular systems and implements a natively unified backbone integrating three principal interactive modules:

World Understanding: Leverages a large-scale vision-LLM (VLM), here built upon the Qwen family, to extract high-resolution semantic representations from diverse multimodal input streams (videos, human demos, robot interactions), serving as the substrate for downstream reasoning and action.
World Generation: Employs a diffusion transformer (DiT) architecture, preconditioned on multimodal semantic embeddings, with hybrid linear temporal attention to generate photorealistic sequences. Notably, attention mechanisms are factorized: sliding-window attention (SWA) for local spatiotemporal dynamics, dilated SWA for mid-scale dependencies, and gated linear attention (GLA) for persistent global memory.
World Prediction: Models the co-evolution of visual and action trajectories with a Mixture-of-Transformers (MoT) stack—comprising Video DiT and a lighter Action DiT—jointly optimized to align trajectory forecasting and control synthesis. A custom attention masking scheme allows efficient, test-time action-only rollout, circumventing the sampling overhead of explicit video generation.
Figure 2: Kairos system framework: a tightly coupled understanding-generation-prediction stack instrumented for native long-horizon reasoning and actionable policy rollout.

Hybrid Linear Attention and Theoretical Justification

High-fidelity, long-horizon generation necessitates scalable temporal modeling. Kairos implements a hybrid linear attention regime, where:

SWA encodes short-range motion patterns with efficient locality.
DSWA (dilated variant) extends the receptive field for mid-horizon dependencies.
GLA, based on Gated Delta Networks (GDN), provides linear-complexity global memory, controlling state propagation and suppressing error drift via a contractive, gated delta-rule.

Crucially, a rigorous theoretical framework demonstrates the necessity (information-theoretic lower bound) of persistent latent states for long-horizon prediction and proves the sufficiency (explicit risk bound) of the hybrid temporal decomposition, given contractive global memory.

Figure 4: DiT block architecture with hybrid linear attention: integrating SWA, DSWA, and GLA for multi-scale temporal modeling.

Figure 6: Gated Delta Network: implements the gated linear attention module with delta-rule memory update, supporting efficient, bounded-length global memory.

Data Curriculum and Native Pre-training

Rather than post-hoc fine-tuning generic video models, Kairos implements a Cross-Embodiment Data Curriculum (CEDC) for scalable, native physical intelligence. The pre-training trajectory is strictly hierarchical:

Physical Knowledge: Massive-scale open-world video observation imparts physics priors and universal regularities.
Human-centric Behavior: Human demonstration data enables structured task understanding and causal intervention modeling.
Robotic Action (Embodiment): Scarce but critical robot traces inject perception-action alignment, mechanically grounding the world model for real execution.

This curriculum is realized via multi-stage pre-training, progressive fine-tuning (domain-specific SFT, model merging), and RL-based preference alignment.

Figure 3: Cross-Embodiment Data Curriculum: progressively aligning open-world, human demonstration, and robot embodiment data for robust native pre-training.

Efficiency, Inference, and Deployment

Kairos incorporates deployment-aware optimization throughout:

Scalable inference: Linear-complexity attention and DiT-cache optimizations ensure sub-millisecond per-frame generation on Nvidia A800 and RTX5090 hardware.
Hardware-awareness: Mixed precision, adaptive quantization (FP8, INT4), and tile-based streaming facilitate real-time inference even on consumer GPUs.
Timestep distillation: Leveraging flow-matching and distribution-matching distillation compresses multi-step diffusion into efficient 4-step generators with negligible quality loss.

Against large-scale competitors (Cosmos, Lingbot, Wan), Kairos achieves superior compute/memory efficiency and lowest latency across resolutions and generation durations.

Figure 5: Performance comparison: Kairos achieves SOTA benchmarks in action models and embodied world models while scaling linearly with duration and outperforming larger baselines in efficiency.

Self-Evolutionary Learning and Closed-Loop Operation

Kairos is architected for self-improving learning cycles: understanding, generation, and prediction are mutually accessible in a closed-loop. During deployment, the model natively supports rollout-evaluation-refinement cycles for self-evolution, including prompt optimization agents and task-centric policy refinement. Internal reward and selection mechanisms (driven by the understanding module) enable automated, continual optimization without human intervention.

Figure 7: Self-evolution framework: closed-loop rollout-evaluation-refinement cycle enables continuous self-improvement in deployed environments.

Benchmarking and Empirical Results

Kairos-4B delivers SOTA or high-ranking results on diverse, rigorous benchmarks:

WorldModelBench-robot, DreamGen Bench, PAI-Bench-robot: Highest or co-highest scores with strong instruction following and physics adherence; surpasses models with 3–7x more parameters.
LIBERO-Plus, RoboTwin-2.0: Highly competitive or first-place performance in embodied control and long-horizon reasoning, validating the joint world-action model stack.
Human Evaluation: Outperforms larger models in direct preference studies on visual fidelity, physical plausibility, and task correctness.
Long-horizon stability: Maintains temporal and physical consistency in 15s unrolled sequences, exceeding baselines where other models drift or degrade.
Ablations: Human-centric curriculum and stronger VLMs directly translate to higher instruction adherence and model robustness.
Figure 10: Human evaluation: Kairos leads in subjective quality, plausibility, and adherence across tasks against larger baselines.

Practical and Theoretical Implications

Practical Impact: Kairos is the most complete demonstration to date of a native, scalable world–action infrastructure for Physical AI, simultaneously addressing data heterogeneity, long-horizon consistency, efficient deployment, and continual adaptation. The ability to operationalize on consumer hardware and the closed-loop self-evolution pathway make it deployable for both research and real-world robotics.

Theoretical Impact: The formal results on information-theoretic necessity and architectural sufficiency define foundational limits for long-horizon world modeling. The hybrid attention design and curriculum-based pretraining strategy set a new standard for future world-action architectures.

Future Directions: Ongoing work focuses on fully autonomous self-evolution (recursive imagination and policy update from real-world closed loops), scaling to universal action spaces for heterogeneous embodiments, and extending the world model substrate to more diverse physical domains and sensor modalities.

Conclusion

Kairos establishes a new paradigm for world models in Physical AI: an endogenously unified, curriculum-driven, and theoretically-grounded system that learns, maintains, and deploys robust world knowledge across long horizons and embodiment axes. The principled architecture, together with strong empirical results and system scalability, positions Kairos as a foundational substrate for next-generation, self-evolving physical intelligence.

Reference: "Kairos: A Native World Model Stack for Physical AI" (2606.16533)

Markdown Report Issue