Vidur-Search: Video and LLM Optimization Framework

Updated 20 January 2026

Vidur-Search is a modular framework that fuses multimodal features for indexing video content and optimizing LLM deployment configurations.
It employs automated pipelines for audio (ASR), visual text (OCR), and image captioning, embedding these features with transformer models for contextual retrieval.
The system uses exhaustive simulation and cost-aware search to enhance LLM performance, demonstrating significant improvements in throughput and resource efficiency.

Vidur-Search is a modular framework for video and LLM workload indexing, retrieval, and configuration optimization. The system encompasses large-scale multimodal content discovery for video archives as well as exhaustive, simulation-driven search over system deployment configurations for LLM inference. Distinct subcomponents ingest raw media, extract and structure rich metadata, implement semantic fusion for contextual search, and in the LLM context, simulate and optimize model deployment under cost and performance constraints.

1. Multimodal Feature Extraction and Serialization

Vidur-Search processes video inputs using pipelines for audio, visual text, and image context:

Audio stream: End-to-end ASR models (Azure/AWS, human parity) perform transcript extraction, yielding time-stamped word and sub-word tokens at ~150 wpm (Nir et al., 2024).
Visual text: OCR extraction employs Florence OCR, with frame sampling at either 1 fps or scene changes. Variant detections are clustered with multi-sequence alignment to consolidate duplicate strings.
Frame context: Image captioning uses Florence captioner (transformer-based), sampled at 2 sec intervals or by shot detection, outputting semantic frame descriptions.

All extracted features are serialized into tagged text streams using modality-specific tags ([ASR], [OCR], [CAP]). Scene-based segmentation groups consecutive tokens into ≤4 096-token segments, maintaining temporal ordering and content fidelity.

For metadata-driven educational video indexing, targeted introductory video slices are obtained via ffmpeg trimming, followed by keyframe extraction using ffprobe I-frame selection. OCR (EasyOCR) is employed to obtain metadata terms with >85% extraction accuracy for attributes including Institute, Publisher, and Professor (Kumbham et al., 2023).

2. Semantic Fusion and Embedding Strategies

Vidur-Search employs late fusion: modalities are concatenated in tagged textual sequences, which are then embedded by transformer models (Nir et al., 2024). The process avoids raw tensor fusion or cross-modal attention outside the encoder.

Supervised Embeddings: DeBERTa-v3-base (86M params, hidden_dim=768) is fine-tuned with multilabel classification heads for topic inference over video segments. Sliding-window mean pooling produces the final video embedding: $v_v = \text{mean}_{w \in \text{windows}} \text{hidden}_w \in \mathbb{R}^{768}$ .
Learning-free Embeddings: OpenAI text-embedding-ada-002 is used for direct segment embedding in $\mathbb{R}^{1536}$ , with mean pooling for full-video vectors.

Cosine similarity between query and segment embeddings ranks content for retrieval. Segment-max pooling determines a video's match score: $\text{score}(v; q) = \max_{i \in \text{segments}(v)} \cos(e_q, e_i)$ . Rankings are produced by sorting these scores in descending order.

3. Topic Ontology, Indexing, and Retrieval

Vidur-Search constructs semantic topic ontologies for guided exploratory search (Nir et al., 2024):

Topic taxonomies (TED.com, IPTC Media Topics) are embedded with the selected encoder, projected in 2D (t-SNE/UMAP) for visualization.
Topics-Map is an interactive UI element where topic nodes are color-coded by cosine similarity to the active query. Selection, lasso, and query expansion via GPT-4 are supported.

Indexing is performed with FAISS or ScaNN vector indices. All segment embeddings are indexed offline, enabling efficient k-NN retrieval and video-level re-ranking with segment-max pooling.

Retrieval evaluation employs Mean Reciprocal Rank (MRR), Recall@k, and Precision@k, with F1 metrics for multilabel topic assignments, demonstrating that multimodal fusion and fine-tuned embeddings substantially raise retrieval performance (e.g., MRR=1.00, multimodal, OpenAI; F1_𝚖𝗶𝚌𝗿𝗼=65% on TED topics).

4. LLM Deployment Configuration Search and Optimization

Vidur-Search incorporates a configuration-search tool built atop the Vidur LLM inference simulator (Agrawal et al., 2024). The process models LLM inference workloads and system configurations, optimizing deployment for cost, throughput, and latency.

Core Components:

Configuration Enumerator: Enumerates deployment configurations $\mathbf{x} = (s,\,r,\;p_{\mathrm{tp}},\,p_{\mathrm{pp}},\,b,\,\sigma,\,\phi)$ , where $s$ is GPU SKU, $r$ is replica count, $p_{\mathrm{tp}}$ / $p_{\mathrm{pp}}$ are parallelization strategies, $b$ is batch size, $\sigma$ is scheduling policy, and $\phi$ denotes scheduler-specific knobs.
Capacity Finder: For each configuration, event-driven simulation determines sustainable QPS (binary search on arrival rates), constrained by maximum P99 scheduling delay.
Cost/Metric Calculator: Computes $C(\mathbf{x})$ (hardware cost), $T(\mathbf{x})$ (throughput), and records latency metrics $L_{\mathrm{TTFT}}$ , $L_{\mathrm{TBT}}$ from simulation traces.
Optimizer and Visualizer: Selects the highest throughput-per-dollar configuration satisfying SLO constraints, outputs Pareto-front plots.

Optimization Formulation:

Maximize throughput per dollar:

$\max_{\mathbf{x}} \frac{T(\mathbf{x})}{C(\mathbf{x})} \quad \text{s.t.} \quad L_{\mathrm{TTFT}}(\mathbf{x}) \le L_{\max},\, L_{\mathrm{TBT}}(\mathbf{x}) \le T_{\max},\, r \le R_{\max}$

Alternatively, minimize cost at a target QPS:

$\min_{\mathbf{x}} C(\mathbf{x}) \quad \text{s.t.} \quad T(\mathbf{x}) \ge T_{\min},\, L_{\mathrm{TTFT}}(\mathbf{x})\le L_{\max},\, L_{\mathrm{TBT}}(\mathbf{x})\le T_{\max}$

Exhaustive configuration enumeration combined with binary search for QPS yields full evaluation of the feasible deployment space in $O(1\,\text{hr})$ on a 96-core CPU.

5. System Architecture, Workflow, and Implementation

The Vidur-Search design incorporates subsystems for both video retrieval and LLM configuration optimization (Kumbham et al., 2023, Nir et al., 2024, Agrawal et al., 2024):

Video Search Workflow:

Stage	Functionality	Key Tools
Ingestion	Raw video capture & slicing	ffmpeg, youtube-dl
Feature Extraction	ASR transcript, OCR, frame captioning	Florence OCR, EasyOCR
Serialization	Merge/tag modality outputs, segment into fixed-length tokens	Python
Embedding	Transformer encoding of tagged segments	DeBERTa/OpenAI
Indexing	Vector DB for segment-level representations	FAISS, ScaNN
Ontology	Topic taxonomy embedding & layout	t-SNE/UMAP, D3
Query/Visualization	UI for topics, GPT-4 expansion, retrieval, results	React, FastAPI

LLM Deployment Workflow:

Stage	Functionality	Key Tools
Config Enumeration	Generate all feasible hardware/workload	Vidur
Simulation	Event-driven performance estimation	Vidur
Optimization	Objective computation, Pareto selection	Python
Visualization	Pareto fronts, SLO compliance plotting	Matplotlib

Implementation leverages Python 3.9, PyTorch, HuggingFace, FAISS, React+D3, and standard REST APIs. GPUs are used for encoder fine-tuning; CPUs are sufficient for simulation and indexing.

6. Evaluation, Performance, and Open Challenges

Datasets: TED Talks (5,439 videos, ~150 topics), Augmented TDT2 (200k), MSR-VTT (10k).
Metrics: Precision $_{\text{micro}}$ , Recall $_{\text{micro}}$ , F1 $_{\text{micro}}$ ; MRR, Recall@k; Perceived Precision@5 (user study).
Results: Fine-tuned DeBERTa-v3 yields F1 $_{\text{micro}} \approx 65\%$ ; multimodal fusion with OpenAI embeddings achieves MRR=1.00 on TED. User study finds perceived Precision@5=82% versus TED.com and YouTube variants.

Attribute extraction accuracy: Publisher (88.03%), Institute (88.88%), Department (82.47%), Professor (85.89%) on held-out NPTEL subset.
Search performance: Sub-millisecond query intersection for attribute-level inverted indices.
No throughput, latency, or storage burden figures were measured; recommended experimental setup involves a 10,000 video, 8-core/32GB RAM baseline.

Models: LLaMA-2 (7B, 70B), InternLM-20B, Qwen-72B.
Cost savings: ~35k configurations on GPU (1.14M GPU hours, \$1.14M) vs. Vidur-Search simulation (12.5 CPU hours, \$125).
Simulator fidelity: Median/P95 request time error ≤3.3%, P99 end-to-end latency error ≤5% near 85% load.

Limitations and Open Challenges

Content extraction pipelines assume quality inputs; OCR and ASR degrade on low-fidelity sources (Kumbham et al., 2023, Nir et al., 2024).
Subject/topic extraction in educational video requires dedicated classifiers; no audio-based metadata in base pipeline.
Large vocabularies and domain drift challenge fuzzy matching; potential remedies include pre-clustering or MinHash-LSH.
LLM configuration search employs exhaustive enumeration; space expansion necessitates Bayesian/genetic search or multi-objective optimization (Agrawal et al., 2024).
Near-capacity deployment presents sensitivity to simulation error; live telemetry or online adaptation are suggested extensions.

Vidur-Search integrates state-of-the-art methods from multimodal fusion, semantic search, and scalable LLM workload simulation. Unlike early fusion systems, which apply joint modeling of raw modalities, Vidur-Search’s concatenation-and-encode paradigm leverages transformer architectures to synthesize contextually rich representations (Nir et al., 2024). The system supports both static (offline) and dynamic (online) evaluations.

A plausible implication is that Vidur-Search forms a bridge between content-based video retrieval and workload-sensitive LLM deployment, offering a generalizable architecture for both information retrieval and large-scale performance optimization in educational and generic media contexts. Further, its cost-saving simulation-driven design marks a substantial advance over brute-force hardware experimentation. Extensions to distributed and multi-objective settings will further increase its scope and practical impact.

Markdown Report Issue Upgrade to Chat

References (3)

VCR: Video representation for Contextual Retrieval (2024)

Efficient Indexing of Meta-Data (Extracted from Educational Videos) (2023)

Vidur: A Large-Scale Simulation Framework For LLM Inference (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vidur-Search.

Vidur-Search: Video and LLM Optimization Framework

1. Multimodal Feature Extraction and Serialization

2. Semantic Fusion and Embedding Strategies

3. Topic Ontology, Indexing, and Retrieval

4. LLM Deployment Configuration Search and Optimization

Core Components:

Optimization Formulation:

5. System Architecture, Workflow, and Implementation

Video Search Workflow:

LLM Deployment Workflow:

6. Evaluation, Performance, and Open Challenges

Video Retrieval (Nir et al., 2024):

Metadata Indexing (Kumbham et al., 2023):

LLM Configuration Search (Agrawal et al., 2024):

Limitations and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Vidur-Search: Video and LLM Optimization Framework

1. Multimodal Feature Extraction and Serialization

2. Semantic Fusion and Embedding Strategies

3. Topic Ontology, Indexing, and Retrieval

4. LLM Deployment Configuration Search and Optimization

Core Components:

Optimization Formulation:

5. System Architecture, Workflow, and Implementation

Video Search Workflow:

LLM Deployment Workflow:

6. Evaluation, Performance, and Open Challenges

Video Retrieval (Nir et al., 2024):

Metadata Indexing (Kumbham et al., 2023):

LLM Configuration Search (Agrawal et al., 2024):

Limitations and Open Challenges

7. Context, Related Systems, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research