Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vidur-Search: Video and LLM Optimization Framework

Updated 20 January 2026
  • Vidur-Search is a modular framework that fuses multimodal features for indexing video content and optimizing LLM deployment configurations.
  • It employs automated pipelines for audio (ASR), visual text (OCR), and image captioning, embedding these features with transformer models for contextual retrieval.
  • The system uses exhaustive simulation and cost-aware search to enhance LLM performance, demonstrating significant improvements in throughput and resource efficiency.

Vidur-Search is a modular framework for video and LLM workload indexing, retrieval, and configuration optimization. The system encompasses large-scale multimodal content discovery for video archives as well as exhaustive, simulation-driven search over system deployment configurations for LLM inference. Distinct subcomponents ingest raw media, extract and structure rich metadata, implement semantic fusion for contextual search, and in the LLM context, simulate and optimize model deployment under cost and performance constraints.

1. Multimodal Feature Extraction and Serialization

Vidur-Search processes video inputs using pipelines for audio, visual text, and image context:

  • Audio stream: End-to-end ASR models (Azure/AWS, human parity) perform transcript extraction, yielding time-stamped word and sub-word tokens at ~150 wpm (Nir et al., 2024).
  • Visual text: OCR extraction employs Florence OCR, with frame sampling at either 1 fps or scene changes. Variant detections are clustered with multi-sequence alignment to consolidate duplicate strings.
  • Frame context: Image captioning uses Florence captioner (transformer-based), sampled at 2 sec intervals or by shot detection, outputting semantic frame descriptions.

All extracted features are serialized into tagged text streams using modality-specific tags ([ASR], [OCR], [CAP]). Scene-based segmentation groups consecutive tokens into ≤4 096-token segments, maintaining temporal ordering and content fidelity.

For metadata-driven educational video indexing, targeted introductory video slices are obtained via ffmpeg trimming, followed by keyframe extraction using ffprobe I-frame selection. OCR (EasyOCR) is employed to obtain metadata terms with >85% extraction accuracy for attributes including Institute, Publisher, and Professor (Kumbham et al., 2023).

2. Semantic Fusion and Embedding Strategies

Vidur-Search employs late fusion: modalities are concatenated in tagged textual sequences, which are then embedded by transformer models (Nir et al., 2024). The process avoids raw tensor fusion or cross-modal attention outside the encoder.

  • Supervised Embeddings: DeBERTa-v3-base (86M params, hidden_dim=768) is fine-tuned with multilabel classification heads for topic inference over video segments. Sliding-window mean pooling produces the final video embedding: vv=meanwwindowshiddenwR768v_v = \text{mean}_{w \in \text{windows}} \text{hidden}_w \in \mathbb{R}^{768}.
  • Learning-free Embeddings: OpenAI text-embedding-ada-002 is used for direct segment embedding in R1536\mathbb{R}^{1536}, with mean pooling for full-video vectors.

Cosine similarity between query and segment embeddings ranks content for retrieval. Segment-max pooling determines a video's match score: score(v;q)=maxisegments(v)cos(eq,ei)\text{score}(v; q) = \max_{i \in \text{segments}(v)} \cos(e_q, e_i). Rankings are produced by sorting these scores in descending order.

3. Topic Ontology, Indexing, and Retrieval

Vidur-Search constructs semantic topic ontologies for guided exploratory search (Nir et al., 2024):

  • Topic taxonomies (TED.com, IPTC Media Topics) are embedded with the selected encoder, projected in 2D (t-SNE/UMAP) for visualization.
  • Topics-Map is an interactive UI element where topic nodes are color-coded by cosine similarity to the active query. Selection, lasso, and query expansion via GPT-4 are supported.

Indexing is performed with FAISS or ScaNN vector indices. All segment embeddings are indexed offline, enabling efficient k-NN retrieval and video-level re-ranking with segment-max pooling.

Retrieval evaluation employs Mean Reciprocal Rank (MRR), Recall@k, and Precision@k, with F1 metrics for multilabel topic assignments, demonstrating that multimodal fusion and fine-tuned embeddings substantially raise retrieval performance (e.g., MRR=1.00, multimodal, OpenAI; F1_𝚖𝗶𝚌𝗿𝗼=65% on TED topics).

4. LLM Deployment Configuration Search and Optimization

Vidur-Search incorporates a configuration-search tool built atop the Vidur LLM inference simulator (Agrawal et al., 2024). The process models LLM inference workloads and system configurations, optimizing deployment for cost, throughput, and latency.

Core Components:

  • Configuration Enumerator: Enumerates deployment configurations x=(s,r,  ptp,ppp,b,σ,ϕ)\mathbf{x} = (s,\,r,\;p_{\mathrm{tp}},\,p_{\mathrm{pp}},\,b,\,\sigma,\,\phi), where ss is GPU SKU, rr is replica count, ptpp_{\mathrm{tp}}/pppp_{\mathrm{pp}} are parallelization strategies, bb is batch size, σ\sigma is scheduling policy, and ϕ\phi denotes scheduler-specific knobs.
  • Capacity Finder: For each configuration, event-driven simulation determines sustainable QPS (binary search on arrival rates), constrained by maximum P99 scheduling delay.
  • Cost/Metric Calculator: Computes C(x)C(\mathbf{x}) (hardware cost), T(x)T(\mathbf{x}) (throughput), and records latency metrics LTTFTL_{\mathrm{TTFT}}, LTBTL_{\mathrm{TBT}} from simulation traces.
  • Optimizer and Visualizer: Selects the highest throughput-per-dollar configuration satisfying SLO constraints, outputs Pareto-front plots.

Optimization Formulation:

Maximize throughput per dollar:

maxxT(x)C(x)s.t.LTTFT(x)Lmax,LTBT(x)Tmax,rRmax\max_{\mathbf{x}} \frac{T(\mathbf{x})}{C(\mathbf{x})} \quad \text{s.t.} \quad L_{\mathrm{TTFT}}(\mathbf{x}) \le L_{\max},\, L_{\mathrm{TBT}}(\mathbf{x}) \le T_{\max},\, r \le R_{\max}

Alternatively, minimize cost at a target QPS:

minxC(x)s.t.T(x)Tmin,LTTFT(x)Lmax,LTBT(x)Tmax\min_{\mathbf{x}} C(\mathbf{x}) \quad \text{s.t.} \quad T(\mathbf{x}) \ge T_{\min},\, L_{\mathrm{TTFT}}(\mathbf{x})\le L_{\max},\, L_{\mathrm{TBT}}(\mathbf{x})\le T_{\max}

Exhaustive configuration enumeration combined with binary search for QPS yields full evaluation of the feasible deployment space in O(1hr)O(1\,\text{hr}) on a 96-core CPU.

5. System Architecture, Workflow, and Implementation

The Vidur-Search design incorporates subsystems for both video retrieval and LLM configuration optimization (Kumbham et al., 2023, Nir et al., 2024, Agrawal et al., 2024):

Video Search Workflow:

Stage Functionality Key Tools
Ingestion Raw video capture & slicing ffmpeg, youtube-dl
Feature Extraction ASR transcript, OCR, frame captioning Florence OCR, EasyOCR
Serialization Merge/tag modality outputs, segment into fixed-length tokens Python
Embedding Transformer encoding of tagged segments DeBERTa/OpenAI
Indexing Vector DB for segment-level representations FAISS, ScaNN
Ontology Topic taxonomy embedding & layout t-SNE/UMAP, D3
Query/Visualization UI for topics, GPT-4 expansion, retrieval, results React, FastAPI

LLM Deployment Workflow:

Stage Functionality Key Tools
Config Enumeration Generate all feasible hardware/workload Vidur
Simulation Event-driven performance estimation Vidur
Optimization Objective computation, Pareto selection Python
Visualization Pareto fronts, SLO compliance plotting Matplotlib

Implementation leverages Python 3.9, PyTorch, HuggingFace, FAISS, React+D3, and standard REST APIs. GPUs are used for encoder fine-tuning; CPUs are sufficient for simulation and indexing.

6. Evaluation, Performance, and Open Challenges

  • Datasets: TED Talks (5,439 videos, ~150 topics), Augmented TDT2 (200k), MSR-VTT (10k).
  • Metrics: Precisionmicro_{\text{micro}}, Recallmicro_{\text{micro}}, F1micro_{\text{micro}}; MRR, Recall@k; Perceived Precision@5 (user study).
  • Results: Fine-tuned DeBERTa-v3 yields F1micro65%_{\text{micro}} \approx 65\%; multimodal fusion with OpenAI embeddings achieves MRR=1.00 on TED. User study finds perceived Precision@5=82% versus TED.com and YouTube variants.
  • Attribute extraction accuracy: Publisher (88.03%), Institute (88.88%), Department (82.47%), Professor (85.89%) on held-out NPTEL subset.
  • Search performance: Sub-millisecond query intersection for attribute-level inverted indices.
  • No throughput, latency, or storage burden figures were measured; recommended experimental setup involves a 10,000 video, 8-core/32GB RAM baseline.
  • Models: LLaMA-2 (7B, 70B), InternLM-20B, Qwen-72B.
  • Cost savings: ~35k configurations on GPU (1.14M GPU hours, \$1.14M) vs. Vidur-Search simulation (12.5 CPU hours, \$125).
  • Simulator fidelity: Median/P95 request time error ≤3.3%, P99 end-to-end latency error ≤5% near 85% load.

Limitations and Open Challenges

  • Content extraction pipelines assume quality inputs; OCR and ASR degrade on low-fidelity sources (Kumbham et al., 2023, Nir et al., 2024).
  • Subject/topic extraction in educational video requires dedicated classifiers; no audio-based metadata in base pipeline.
  • Large vocabularies and domain drift challenge fuzzy matching; potential remedies include pre-clustering or MinHash-LSH.
  • LLM configuration search employs exhaustive enumeration; space expansion necessitates Bayesian/genetic search or multi-objective optimization (Agrawal et al., 2024).
  • Near-capacity deployment presents sensitivity to simulation error; live telemetry or online adaptation are suggested extensions.

Vidur-Search integrates state-of-the-art methods from multimodal fusion, semantic search, and scalable LLM workload simulation. Unlike early fusion systems, which apply joint modeling of raw modalities, Vidur-Search’s concatenation-and-encode paradigm leverages transformer architectures to synthesize contextually rich representations (Nir et al., 2024). The system supports both static (offline) and dynamic (online) evaluations.

A plausible implication is that Vidur-Search forms a bridge between content-based video retrieval and workload-sensitive LLM deployment, offering a generalizable architecture for both information retrieval and large-scale performance optimization in educational and generic media contexts. Further, its cost-saving simulation-driven design marks a substantial advance over brute-force hardware experimentation. Extensions to distributed and multi-objective settings will further increase its scope and practical impact.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vidur-Search.