Youtu-LLM: Video-Language Reasoning
- Youtu-LLM is a family of models that fuses language and video comprehension using compact Transformer designs and streaming multimodal processing.
- It employs low-rank multi-head latent compression and dense token interleaving to enable efficient long-context modeling and fine-grained video-speech alignment.
- A four-stage, curriculum-based pre-training pipeline, along with agentic behavioral conditioning, delivers state-of-the-art performance in reasoning and real-time video QA.
Youtu-LLM is a family of language and video-LLMs specifically designed for high-performance, agentic reasoning and streaming multimodal understanding in the YouTube video domain. This suite comprises both pure LLMs for native agentic intelligence and large video-LLMs for fine-grained video and speech comprehension with streaming capabilities. Youtu-LLM synthesizes advances in compact Transformer architectures, curriculum-based pre-training, large-scale video-transcript alignment, and agentic behavioral conditioning, enabling robust, long-horizon reasoning, planning, and real-time video-language interaction—even on resource-constrained devices.
1. Model Architecture and Technical Innovations
Youtu-LLM (base model, 1.96B parameters) is built on a dense Multi-Latent Attention (MLA) Transformer backbone, replacing conventional full-rank self-attention with a low-rank, multi-head latent compression scheme as introduced in DeepSeek-V2/V3. Each token sequence is projected into multiple () latent spaces, where each head leverages low-dimensional key/value projections (, ), reducing memory usage per-layer from to . This key–value cache compression is crucial for enabling a 128K-token context window within a minimal memory budget; for and , memory requirements drop by $1/16$ per layer compared to standard attention (Lu et al., 31 Dec 2025).
The tokenizer is a three-stage byte-level BPE, culminating in a 128,256-token vocabulary with specialized STEM and code tokens, strict splitting of CJK scripts, and explicit avoidance of multi-digit number tokens. Compression rates on pre-training data achieve a 1.15× reduction over Llama3, with a 10% additional gain on reasoning data.
For video-language tasks, Youtu-LLM adopts a streaming multimodal backbone: a Vision Transformer (ViT) dynamically encodes video frames into 1-D visual tokens, which are projected via a lightweight adapter into the LLM's embedding space. The architecture supports densely interleaved input sequences of the form [Context] → [Frame tokens] → [ASR/Subtitle tokens], enabling temporally aligned, fine-grained video–language modeling at scale (Chen et al., 22 Apr 2025).
2. Pre-Training Corpus and Curriculum
Youtu-LLM undergoes a four-stage curriculum-based pre-training on a filtered 10.84T-token corpus:
- Stage 1: 8.16T tokens emphasizing commonsense and high-quality web/encyclopedia data (75%) with a STEM/code supplement (25%). Sequence length is 8K tokens.
- Stage 2: 1.28T tokens, raising STEM + code share to 60%.
- Stage 3: 1.2T tokens, further increasing STEM/coding incidence (~65%), extending sequence lengths up to 128K.
- Stage 4 (Agentic Mid-training): 200B tokens of structured agentic trajectories, including explicit planning, reflection, tool-use, mathematical, coding, and research tasks.
The agentic training set is partitioned into five domains, each leveraging atomic skills and XML-tagged structures (e.g., <Analysis>, <Plan>, <Action>, <Reflection>, <Summary>). This progression ensures substantive internalization of planning, decomposition, and corrective reflection, rather than mere surface alignment.
For video-language Youtu-LLM, the primary corpus is constructed from YouTube-derived sources (HD-VILA, YT-Temporal, VidChapters, HowTo100M), with strict filtering for English language, clip informativeness, and absence of "talking-head" content. The final "Live-CC-5M" set contains 5.7M clips for pre-training; an SFT set ("Live-WhisperX-526K") is generated by WhisperX ASR with precise word-level alignment and additional prompt annotation (Chen et al., 22 Apr 2025).
3. Streaming Multimodal Pre-Training and Token Interleaving
Youtu-LLM's streaming video-LLMs employ a dense interleaving strategy for joint training and inference:
- Frames are sampled at 2 FPS; ASR tokens are timestamp-aligned in 1-second intervals.
- Each interval's frames ("<F>") and ASR words ("<W>") are alternated:
[Context] <F_{t:t+k}> <W_{t:t+k}> <F_{t+k:t+2k}> <W_{t+k:t+2k}> ⋯. - The model autoregressively predicts next tokens conditioned on the joint context, with an optional alignment loss to encourage feature-level synchronization between visual and textual representations:
- Inference leverages efficient key–value caching, windowed attention (truncating visual tokens older than 240s), and micro-batch streaming for real-time performance.
- The batching strategy and "..." ellipsis token address variable-length and silent intervals, optimizing throughput and robustness to speech rate variations.
This dense, temporally-aligned interleaving allows Youtu-LLM to finely model the evolving multimodal context, supporting both video QA and real-time video commentary (Chen et al., 22 Apr 2025).
4. Agentic Capabilities and Benchmark Results
Through principled curriculum and structured trajectory pre-training, Youtu-LLM demonstrates strong native agentic intelligence. The 1.96B model closes more than 80% of the gap to Qwen3 4B on the APTBench probe of agentic skills across code, deep research, math, and tool use domains. Instruct-tuned versions achieve new state-of-the-art among sub-2B models on practical agent and reasoning frameworks (GAIA: 33.9%, xBench: 19.5%, SWE-Bench-Verified: 17.7%, EnConda: 21.5%), with competitive results against much larger open models (Lu et al., 31 Dec 2025).
On video-language tasks, Youtu-LLM achieves a 41.5% win rate versus GPT-4o commentary in real-time mode on the LiveSports-3K benchmark (Qwen2.5-VL-7B-Instruct: 17.3%, LLaVA-Video-7B: 27.1%). It delivers a state-of-the-art 64.1% overall accuracy on VideoMME QA (without subtitles, comparable to Qwen2-VL-72B) and a frame-wise response latency of 0.17 s, an order-of-magnitude faster than previous 7B–72B video LLMs (Chen et al., 22 Apr 2025).
On language-centric STEM, reasoning, and code benchmarks, Youtu-LLM 2B Base matches or exceeds all open models below 7B and challenges 4B-scale baselines (MMLU-Pro 48.4%, GSM8K 77.6%, HumanEval 64.6%).
5. Implementation Details and Practical Deployment
Youtu-LLM's pre-training uses a batch size of 512, distributed across 128 A100 GPUs. Learning rates are 2e-5 for pre-training and 1e-5 for SFT. Context windows during pre-training support 480 frames (240s), with text contexts up to 2K tokens.
KV-cache streaming decoding and Flash-Attention/Flash-Decoding optimizations reduce inference latency, with recommended truncation of visual history beyond 240s. For further speed and efficiency, lightweight frame encoders (e.g., MorphMLP) and aggressive KV-caching are advised. The system supports batching of real-time streaming requests across users (Chen et al., 22 Apr 2025).
Batching and data curation pipelines emphasize grouping similar-length clips, silence detection, pruning of low-informativeness sequences, and domain-adaptive filtering. The architecture and training recipes generalize to other video domains with task-specific fine-tuning.
6. Limitations and Prospects for Extension
Current limitations include potential degradation on very long contexts, challenge in precise celebrity or fine-grained entity recognition, and limited robustness to adversarial thumbnail or frame perturbations. Real-time deployment is constrained by visual encoder and GPU throughput; further latency reductions can be obtained via hardware-aware model selection and micro-batching.
Future directions proposed include:
- World-model integration for explicit environment simulation during planning
- Exploration of diffusion-based LLM decoders for long-sequence efficiency
- Multimodal agentic pre-training covering vision, audio, and OCR tools
- Continual and user-personalized adaptation via lightweight, on-device fine-tuning
- Real-time, on-upload moderation for platforms, and cross-platform adaptation (e.g., TikTok, Instagram Reels)
These developments position Youtu-LLM as a model family breaking the tradeoff between size and native agentic, multimodal reasoning ability, and as a blueprint for further advances in streaming, long-horizon video-language intelligence (Lu et al., 31 Dec 2025, Chen et al., 22 Apr 2025).