LiveStar: Advanced Live Video Systems
- LiveStar is a dual-purpose system combining cross-domain recommendation and video streaming assistance, offering dynamic real-time responsiveness.
- The recommendation engine leverages multi-task modeling, sequence search, and contrastive alignment to predict right-moment click-through rates.
- The Video-LLM assistant employs frame-level embedding, Peak-End Memory Compression, and streaming causal attention to boost semantic accuracy and inference speed.
LiveStar refers to two distinct yet technically advanced systems for live video platforms: (1) a real-time cross-domain recommendation engine designed and deployed at Kuaishou to predict right-moment click-through rates (CTR) for live-streams (Cao et al., 11 Aug 2024), and (2) a live streaming assistant based on Video LLMs (Video-LLMs) for online video understanding and narration, capable of frame-by-frame context alignment, adaptive response-silence decoding, and memory-aware acceleration (Yang et al., 7 Nov 2025). Both instances of LiveStar address the complexity and temporal dynamics of live streaming, though with different algorithmic focuses and application domains.
1. System Architectures
LiveStar (Kuaishou Recommendation Engine)
LiveStar’s serving pipeline encompasses both offline training and online inference. Offline, a data-streaming engine aggregates real-time logs from short-video and live-streaming services, producing feature vectors per user–live pair every 30 s. These include up to 10,000 historical item IDs per user, detailed with timestamps, author, multimodal tags, and categories. Trainer modules perform embedding lookup for >400 raw fields, followed by sequence search and aggregation (using General Search Units, GSUs, and Exact Search Units, ESUs) and multi-task modeling (CTR/LVTR/LTR, etc.) through variants such as Progressive Layered Extraction (PLE) or Compositional Gate Control (CGC), optimizing a sum of the “Moment” and cross-domain contrastive losses.
Online, for each recommendation request, the engine: (1) retrieves real-time features, (2) queries GSUs for historical embeddings (V{short}, V{long}, etc.), (3) computes ESU attention summaries, (4) concatenates user, item, context, and ESU outputs, and (5) forwards through PLE (or MMoE) to estimate , , , etc., producing scores and ranking candidates.
LiveStar (Video-LLM Assistant)
The assistant version comprises a multi-component pipeline:
- Vision Encoder (InternViT) samples at 1–4 FPS, converting frames into 16-token patch embeddings.
- MLP Projector maps these frame embeddings into the LLM token space.
- LLM (InternLM2.5-7B) supports autoregressive text generation for captions, answers, or silence decisions.
- Streaming Causal Attention Masks (SCAM) enforce incremental frame-to-caption attention constraints during training.
- Streaming Verification Decoding (SVeD) provides an algorithmic gating mechanism that evaluates the need for new narration using forward-pass perplexity tests.
- Peak-End Memory Compression and streaming key-value (KV) cache optimize memory for long sequences, supporting frame-level context retention and speedup over baseline inference.
2. Core Algorithms and Training Strategies
Moment Module (Kuaishou)
The Moment module in LiveStar targets temporally precise response prediction by leveraging a 30 s real-time reporting window and “first-only” label masking. The moment loss is:
where summation spans only the initial positive event of each behavior and all exit events (both positive and negative), masking intermediate positives. Temporal recency and freshness are encoded as
with score: .
Cross Module (Kuaishou)
Domain transfer leverages embeddings from short-video and live-stream histories, using GSUs to extract top-L relevant interactions. Contrastive alignment is applied by minimizing losses between mixed-history sequences and domain-specific vectors:
ESU-based attention fuses these vectors into the main model input using scaled dot-product mechanisms.
Streaming Alignment & Attention (Video-LLM)
LiveStar’s Video-LLM training objective generalizes classic video–text alignment to streaming context:
SCAM masks impose clip-level constraints: tokens can attend only to prior frames, last caption tokens of previous clips, and local prefix tokens of the current caption. This enforces non-trivial associations and guards against leakage from future information.
3. Serving, Memory Efficiency, and Scalability
Recommendation Engine (Kuaishou)
Real-time refresh is executed every 30 s. Model updates occur with streaming training jobs on 30 s data, full retraining every two hours and incremental hourly refresh. Serving infrastructure utilizes parameter-server embedding lookup and FPGA/GPU micro-VMs for sub-30 ms p99 latency at requests/s per shard. Feature engineering comprises 400+ fields with 64–256 dimensional embeddings.
Streaming Assistant (Video-LLM)
Efficient online inference is maintained through Peak-End Memory Compression—semantically salient “peak” frames and “end” captions are retained, others pruned probabilistically within a sliding window. Dual-level streaming KV caches prevent redundant recomputation across transformer blocks, increasing FPS from $2.50$ to $3.82$ for 5-minute videos ( speedup). Uniform dropout and FIFO forgetting strategies are empirically suboptimal, with Peak-End preserving semantic accuracy and low timing difference.
4. Evaluation Datasets and Metrics
Kuaishou LiveStar (CTR Model)
Empirical assessment employs fullrank AUC and GAUC on held-out logs. PLE(+Moment+Cross) attains CTR AUC $82.75$ ($65.57$ GAUC) and gift AUC $96.32$ ($74.71$ GAUC). Removal of GSUs drops GAUC by up to points; domain module ablation degrades online metrics (WatchTime, GiftCount, Likes, Comments, Follows), with pronounced effect for low-gift users.
OmniStar Benchmark (Video-LLM)
The OmniStar dataset comprises 20,137 annotated live video streams across 15 scenarios and 46 sub-categories, structured into five online tasks: Real-time Narration Generation (RNG), Online Temporal Grounding (OTG), Frame-level Dense QA (FDQ), Contextual Online QA (COQ), and Multi-turn Interactive QA (MIQ). Metrics include Semantic Correctness (SemCor), Timing Difference (TimDiff), and Frames Per Second (FPS):
| Method | SemCor (↑) | TimDiff (↓) | FPS (↑) |
|---|---|---|---|
| MMDuet | 5.62 | 2.32 s | 0.91 |
| LiveStar | 6.69 | 1.90 s | 3.82 |
LiveStar exhibits SemCor, TimDiff, FPS over prior baselines.
5. Data Imbalance, Limitations, and Prospective Improvements
Data Imbalance (Kuaishou)
Short-video exposures outnumber live exposures by a factor of nine, yielding sparsity in live data. The cross-domain module bridges this by introducing richly aligned short-video signals, with multi-task loss weighting and contrastive alignment (rather than explicit re-sampling) mitigating imbalance.
Limitations and Future Directions (Video-LLM)
Current implementation restricts frame encoding to 16 tokens, limiting fine-grained motion capture. Multimodal extensions (audio integration), adaptive memory window sizing, self-supervised pre-training, and hierarchical memory for hour-long interactions are highlighted as potential improvements.
A plausible implication is that memory compression mechanisms like Peak-End may generalize to other long-context streaming modalities beyond video.
6. Practical Significance and Broader Impact
LiveStar illustrates two prevailing trends in live-video system design: (1) domain-adaptive, temporally sensitive recommendation with multi-task objectives and cross-history attention (Kuaishou); and (2) streaming online video understanding with incremental multimodal alignment, silence-decoding, and memory-aware acceleration (Video-LLMs). Both approaches achieve substantial improvements in real-time responsiveness, semantic capture, throughput, and monetization. The framework generalizes to similar platforms with dynamic content, delayed feedback, and heterogeneous data sources (Cao et al., 11 Aug 2024, Yang et al., 7 Nov 2025).