Vidur-Search: Video and LLM Optimization Framework
- Vidur-Search is a modular framework that fuses multimodal features for indexing video content and optimizing LLM deployment configurations.
- It employs automated pipelines for audio (ASR), visual text (OCR), and image captioning, embedding these features with transformer models for contextual retrieval.
- The system uses exhaustive simulation and cost-aware search to enhance LLM performance, demonstrating significant improvements in throughput and resource efficiency.
Vidur-Search is a modular framework for video and LLM workload indexing, retrieval, and configuration optimization. The system encompasses large-scale multimodal content discovery for video archives as well as exhaustive, simulation-driven search over system deployment configurations for LLM inference. Distinct subcomponents ingest raw media, extract and structure rich metadata, implement semantic fusion for contextual search, and in the LLM context, simulate and optimize model deployment under cost and performance constraints.
1. Multimodal Feature Extraction and Serialization
Vidur-Search processes video inputs using pipelines for audio, visual text, and image context:
- Audio stream: End-to-end ASR models (Azure/AWS, human parity) perform transcript extraction, yielding time-stamped word and sub-word tokens at ~150 wpm (Nir et al., 2024).
- Visual text: OCR extraction employs Florence OCR, with frame sampling at either 1 fps or scene changes. Variant detections are clustered with multi-sequence alignment to consolidate duplicate strings.
- Frame context: Image captioning uses Florence captioner (transformer-based), sampled at 2 sec intervals or by shot detection, outputting semantic frame descriptions.
All extracted features are serialized into tagged text streams using modality-specific tags ([ASR], [OCR], [CAP]). Scene-based segmentation groups consecutive tokens into ≤4 096-token segments, maintaining temporal ordering and content fidelity.
For metadata-driven educational video indexing, targeted introductory video slices are obtained via ffmpeg trimming, followed by keyframe extraction using ffprobe I-frame selection. OCR (EasyOCR) is employed to obtain metadata terms with >85% extraction accuracy for attributes including Institute, Publisher, and Professor (Kumbham et al., 2023).
2. Semantic Fusion and Embedding Strategies
Vidur-Search employs late fusion: modalities are concatenated in tagged textual sequences, which are then embedded by transformer models (Nir et al., 2024). The process avoids raw tensor fusion or cross-modal attention outside the encoder.
- Supervised Embeddings: DeBERTa-v3-base (86M params, hidden_dim=768) is fine-tuned with multilabel classification heads for topic inference over video segments. Sliding-window mean pooling produces the final video embedding: .
- Learning-free Embeddings: OpenAI text-embedding-ada-002 is used for direct segment embedding in , with mean pooling for full-video vectors.
Cosine similarity between query and segment embeddings ranks content for retrieval. Segment-max pooling determines a video's match score: . Rankings are produced by sorting these scores in descending order.
3. Topic Ontology, Indexing, and Retrieval
Vidur-Search constructs semantic topic ontologies for guided exploratory search (Nir et al., 2024):
- Topic taxonomies (TED.com, IPTC Media Topics) are embedded with the selected encoder, projected in 2D (t-SNE/UMAP) for visualization.
- Topics-Map is an interactive UI element where topic nodes are color-coded by cosine similarity to the active query. Selection, lasso, and query expansion via GPT-4 are supported.
Indexing is performed with FAISS or ScaNN vector indices. All segment embeddings are indexed offline, enabling efficient k-NN retrieval and video-level re-ranking with segment-max pooling.
Retrieval evaluation employs Mean Reciprocal Rank (MRR), Recall@k, and Precision@k, with F1 metrics for multilabel topic assignments, demonstrating that multimodal fusion and fine-tuned embeddings substantially raise retrieval performance (e.g., MRR=1.00, multimodal, OpenAI; F1_𝚖𝗶𝚌𝗿𝗼=65% on TED topics).
4. LLM Deployment Configuration Search and Optimization
Vidur-Search incorporates a configuration-search tool built atop the Vidur LLM inference simulator (Agrawal et al., 2024). The process models LLM inference workloads and system configurations, optimizing deployment for cost, throughput, and latency.
Core Components:
- Configuration Enumerator: Enumerates deployment configurations , where is GPU SKU, is replica count, / are parallelization strategies, is batch size, is scheduling policy, and denotes scheduler-specific knobs.
- Capacity Finder: For each configuration, event-driven simulation determines sustainable QPS (binary search on arrival rates), constrained by maximum P99 scheduling delay.
- Cost/Metric Calculator: Computes (hardware cost), (throughput), and records latency metrics , from simulation traces.
- Optimizer and Visualizer: Selects the highest throughput-per-dollar configuration satisfying SLO constraints, outputs Pareto-front plots.
Optimization Formulation:
Maximize throughput per dollar:
Alternatively, minimize cost at a target QPS:
Exhaustive configuration enumeration combined with binary search for QPS yields full evaluation of the feasible deployment space in on a 96-core CPU.
5. System Architecture, Workflow, and Implementation
The Vidur-Search design incorporates subsystems for both video retrieval and LLM configuration optimization (Kumbham et al., 2023, Nir et al., 2024, Agrawal et al., 2024):
Video Search Workflow:
| Stage | Functionality | Key Tools |
|---|---|---|
| Ingestion | Raw video capture & slicing | ffmpeg, youtube-dl |
| Feature Extraction | ASR transcript, OCR, frame captioning | Florence OCR, EasyOCR |
| Serialization | Merge/tag modality outputs, segment into fixed-length tokens | Python |
| Embedding | Transformer encoding of tagged segments | DeBERTa/OpenAI |
| Indexing | Vector DB for segment-level representations | FAISS, ScaNN |
| Ontology | Topic taxonomy embedding & layout | t-SNE/UMAP, D3 |
| Query/Visualization | UI for topics, GPT-4 expansion, retrieval, results | React, FastAPI |
LLM Deployment Workflow:
| Stage | Functionality | Key Tools |
|---|---|---|
| Config Enumeration | Generate all feasible hardware/workload | Vidur |
| Simulation | Event-driven performance estimation | Vidur |
| Optimization | Objective computation, Pareto selection | Python |
| Visualization | Pareto fronts, SLO compliance plotting | Matplotlib |
Implementation leverages Python 3.9, PyTorch, HuggingFace, FAISS, React+D3, and standard REST APIs. GPUs are used for encoder fine-tuning; CPUs are sufficient for simulation and indexing.
6. Evaluation, Performance, and Open Challenges
Video Retrieval (Nir et al., 2024):
- Datasets: TED Talks (5,439 videos, ~150 topics), Augmented TDT2 (200k), MSR-VTT (10k).
- Metrics: Precision, Recall, F1; MRR, Recall@k; Perceived Precision@5 (user study).
- Results: Fine-tuned DeBERTa-v3 yields F1; multimodal fusion with OpenAI embeddings achieves MRR=1.00 on TED. User study finds perceived Precision@5=82% versus TED.com and YouTube variants.
Metadata Indexing (Kumbham et al., 2023):
- Attribute extraction accuracy: Publisher (88.03%), Institute (88.88%), Department (82.47%), Professor (85.89%) on held-out NPTEL subset.
- Search performance: Sub-millisecond query intersection for attribute-level inverted indices.
- No throughput, latency, or storage burden figures were measured; recommended experimental setup involves a 10,000 video, 8-core/32GB RAM baseline.
LLM Configuration Search (Agrawal et al., 2024):
- Models: LLaMA-2 (7B, 70B), InternLM-20B, Qwen-72B.
- Cost savings: ~35k configurations on GPU (1.14M GPU hours, \$1.14M) vs. Vidur-Search simulation (12.5 CPU hours, \$125).
- Simulator fidelity: Median/P95 request time error ≤3.3%, P99 end-to-end latency error ≤5% near 85% load.
Limitations and Open Challenges
- Content extraction pipelines assume quality inputs; OCR and ASR degrade on low-fidelity sources (Kumbham et al., 2023, Nir et al., 2024).
- Subject/topic extraction in educational video requires dedicated classifiers; no audio-based metadata in base pipeline.
- Large vocabularies and domain drift challenge fuzzy matching; potential remedies include pre-clustering or MinHash-LSH.
- LLM configuration search employs exhaustive enumeration; space expansion necessitates Bayesian/genetic search or multi-objective optimization (Agrawal et al., 2024).
- Near-capacity deployment presents sensitivity to simulation error; live telemetry or online adaptation are suggested extensions.
7. Context, Related Systems, and Implications
Vidur-Search integrates state-of-the-art methods from multimodal fusion, semantic search, and scalable LLM workload simulation. Unlike early fusion systems, which apply joint modeling of raw modalities, Vidur-Search’s concatenation-and-encode paradigm leverages transformer architectures to synthesize contextually rich representations (Nir et al., 2024). The system supports both static (offline) and dynamic (online) evaluations.
A plausible implication is that Vidur-Search forms a bridge between content-based video retrieval and workload-sensitive LLM deployment, offering a generalizable architecture for both information retrieval and large-scale performance optimization in educational and generic media contexts. Further, its cost-saving simulation-driven design marks a substantial advance over brute-force hardware experimentation. Extensions to distributed and multi-objective settings will further increase its scope and practical impact.