Papers
Topics
Authors
Recent
2000 character limit reached

Youtu-LLM: Video-Language Reasoning

Updated 2 January 2026
  • Youtu-LLM is a family of models that fuses language and video comprehension using compact Transformer designs and streaming multimodal processing.
  • It employs low-rank multi-head latent compression and dense token interleaving to enable efficient long-context modeling and fine-grained video-speech alignment.
  • A four-stage, curriculum-based pre-training pipeline, along with agentic behavioral conditioning, delivers state-of-the-art performance in reasoning and real-time video QA.

Youtu-LLM is a family of language and video-LLMs specifically designed for high-performance, agentic reasoning and streaming multimodal understanding in the YouTube video domain. This suite comprises both pure LLMs for native agentic intelligence and large video-LLMs for fine-grained video and speech comprehension with streaming capabilities. Youtu-LLM synthesizes advances in compact Transformer architectures, curriculum-based pre-training, large-scale video-transcript alignment, and agentic behavioral conditioning, enabling robust, long-horizon reasoning, planning, and real-time video-language interaction—even on resource-constrained devices.

1. Model Architecture and Technical Innovations

Youtu-LLM (base model, 1.96B parameters) is built on a dense Multi-Latent Attention (MLA) Transformer backbone, replacing conventional full-rank self-attention with a low-rank, multi-head latent compression scheme as introduced in DeepSeek-V2/V3. Each token sequence X∈RL×dX \in \mathbb{R}^{L \times d} is projected into multiple (H=16H=16) latent spaces, where each head leverages low-dimensional key/value projections (r=128r=128, d=2048d=2048), reducing memory usage per-layer from O(Ld)O(Ld) to O(Lr)O(Lr). This key–value cache compression is crucial for enabling a 128K-token context window within a minimal memory budget; for d=2048d=2048 and r=128r=128, memory requirements drop by $1/16$ per layer compared to standard attention (Lu et al., 31 Dec 2025).

The tokenizer is a three-stage byte-level BPE, culminating in a 128,256-token vocabulary with specialized STEM and code tokens, strict splitting of CJK scripts, and explicit avoidance of multi-digit number tokens. Compression rates on pre-training data achieve a 1.15× reduction over Llama3, with a 10% additional gain on reasoning data.

For video-language tasks, Youtu-LLM adopts a streaming multimodal backbone: a Vision Transformer (ViT) dynamically encodes video frames into 1-D visual tokens, which are projected via a lightweight adapter into the LLM's embedding space. The architecture supports densely interleaved input sequences of the form [Context] → [Frame tokens] → [ASR/Subtitle tokens], enabling temporally aligned, fine-grained video–language modeling at scale (Chen et al., 22 Apr 2025).

2. Pre-Training Corpus and Curriculum

Youtu-LLM undergoes a four-stage curriculum-based pre-training on a filtered 10.84T-token corpus:

  • Stage 1: 8.16T tokens emphasizing commonsense and high-quality web/encyclopedia data (75%) with a STEM/code supplement (25%). Sequence length is 8K tokens.
  • Stage 2: 1.28T tokens, raising STEM + code share to 60%.
  • Stage 3: 1.2T tokens, further increasing STEM/coding incidence (~65%), extending sequence lengths up to 128K.
  • Stage 4 (Agentic Mid-training): 200B tokens of structured agentic trajectories, including explicit planning, reflection, tool-use, mathematical, coding, and research tasks.

The agentic training set is partitioned into five domains, each leveraging atomic skills and XML-tagged structures (e.g., <Analysis>, <Plan>, <Action>, <Reflection>, <Summary>). This progression ensures substantive internalization of planning, decomposition, and corrective reflection, rather than mere surface alignment.

For video-language Youtu-LLM, the primary corpus is constructed from YouTube-derived sources (HD-VILA, YT-Temporal, VidChapters, HowTo100M), with strict filtering for English language, clip informativeness, and absence of "talking-head" content. The final "Live-CC-5M" set contains 5.7M clips for pre-training; an SFT set ("Live-WhisperX-526K") is generated by WhisperX ASR with precise word-level alignment and additional prompt annotation (Chen et al., 22 Apr 2025).

3. Streaming Multimodal Pre-Training and Token Interleaving

Youtu-LLM's streaming video-LLMs employ a dense interleaving strategy for joint training and inference:

  • Frames are sampled at 2 FPS; ASR tokens are timestamp-aligned in 1-second intervals.
  • Each interval's frames ("<F>") and ASR words ("<W>") are alternated: [Context] <F_{t:t+k}> <W_{t:t+k}> <F_{t+k:t+2k}> <W_{t+k:t+2k}> ⋯.
  • The model autoregressively predicts next tokens conditioned on the joint context, with an optional alignment loss to encourage feature-level synchronization between visual and textual representations:

Lalign=∑i,j∥f(vi)−g(wj)∥2\mathcal{L}_\mathrm{align} = \sum_{i,j} \left\| f(v_i) - g(w_j) \right\|^2

  • Inference leverages efficient key–value caching, windowed attention (truncating visual tokens older than 240s), and micro-batch streaming for real-time performance.
  • The batching strategy and "..." ellipsis token address variable-length and silent intervals, optimizing throughput and robustness to speech rate variations.

This dense, temporally-aligned interleaving allows Youtu-LLM to finely model the evolving multimodal context, supporting both video QA and real-time video commentary (Chen et al., 22 Apr 2025).

4. Agentic Capabilities and Benchmark Results

Through principled curriculum and structured trajectory pre-training, Youtu-LLM demonstrates strong native agentic intelligence. The 1.96B model closes more than 80% of the gap to Qwen3 4B on the APTBench probe of agentic skills across code, deep research, math, and tool use domains. Instruct-tuned versions achieve new state-of-the-art among sub-2B models on practical agent and reasoning frameworks (GAIA: 33.9%, xBench: 19.5%, SWE-Bench-Verified: 17.7%, EnConda: 21.5%), with competitive results against much larger open models (Lu et al., 31 Dec 2025).

On video-language tasks, Youtu-LLM achieves a 41.5% win rate versus GPT-4o commentary in real-time mode on the LiveSports-3K benchmark (Qwen2.5-VL-7B-Instruct: 17.3%, LLaVA-Video-7B: 27.1%). It delivers a state-of-the-art 64.1% overall accuracy on VideoMME QA (without subtitles, comparable to Qwen2-VL-72B) and a frame-wise response latency of 0.17 s, an order-of-magnitude faster than previous 7B–72B video LLMs (Chen et al., 22 Apr 2025).

On language-centric STEM, reasoning, and code benchmarks, Youtu-LLM 2B Base matches or exceeds all open models below 7B and challenges 4B-scale baselines (MMLU-Pro 48.4%, GSM8K 77.6%, HumanEval 64.6%).

5. Implementation Details and Practical Deployment

Youtu-LLM's pre-training uses a batch size of 512, distributed across 128 A100 GPUs. Learning rates are 2e-5 for pre-training and 1e-5 for SFT. Context windows during pre-training support 480 frames (240s), with text contexts up to 2K tokens.

KV-cache streaming decoding and Flash-Attention/Flash-Decoding optimizations reduce inference latency, with recommended truncation of visual history beyond 240s. For further speed and efficiency, lightweight frame encoders (e.g., MorphMLP) and aggressive KV-caching are advised. The system supports batching of real-time streaming requests across users (Chen et al., 22 Apr 2025).

Batching and data curation pipelines emphasize grouping similar-length clips, silence detection, pruning of low-informativeness sequences, and domain-adaptive filtering. The architecture and training recipes generalize to other video domains with task-specific fine-tuning.

6. Limitations and Prospects for Extension

Current limitations include potential degradation on very long contexts, challenge in precise celebrity or fine-grained entity recognition, and limited robustness to adversarial thumbnail or frame perturbations. Real-time deployment is constrained by visual encoder and GPU throughput; further latency reductions can be obtained via hardware-aware model selection and micro-batching.

Future directions proposed include:

  • World-model integration for explicit environment simulation during planning
  • Exploration of diffusion-based LLM decoders for long-sequence efficiency
  • Multimodal agentic pre-training covering vision, audio, and OCR tools
  • Continual and user-personalized adaptation via lightweight, on-device fine-tuning
  • Real-time, on-upload moderation for platforms, and cross-platform adaptation (e.g., TikTok, Instagram Reels)

These developments position Youtu-LLM as a model family breaking the tradeoff between size and native agentic, multimodal reasoning ability, and as a blueprint for further advances in streaming, long-horizon video-language intelligence (Lu et al., 31 Dec 2025, Chen et al., 22 Apr 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Youtu-LLM.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube