ImagerySearch: Adaptive Video Generation

Updated 18 October 2025

ImagerySearch is an adaptive video generation strategy that dynamically adjusts the candidate sampling and reward functions based on the semantic structure of input prompts.
It incorporates a semantic distance metric to tailor the search space, thereby enhancing coherence and managing long-distance dependencies in video synthesis.
Evaluated on LDT-Bench, ImagerySearch outperforms standard models in accurately depicting video elements, ensuring temporal alignment, and reducing visual anomalies.

ImagerySearch is an adaptive test-time search strategy developed to address the limitations of current video generation models—especially their inability to reliably generate coherent and plausible videos for imaginative prompts involving rarely co-occurring concepts and long-distance semantic relationships. By dynamically adjusting both inference search space and reward functions based on the semantic structure of a prompt, ImagerySearch achieves marked improvements in visual coherence and semantic alignment, particularly for prompts that lie outside of conventional training distributions. This strategy is evaluated using LDT-Bench, the first benchmark dedicated to long-distance semantic video generation scenarios, and outperforms strong test-time scaling and baseline models on this suite as well as on established general video benchmarks.

1. Overview and Motivation

ImagerySearch was developed to overcome the deficiency of existing video generation frameworks in handling imaginative or compositionally novel prompts. Video models typically excel in realistic scenarios—those with objects and actions that co-occur frequently in training data—but their performance degrades substantially for prompts involving rare, semantically distant, or compositionally challenging entity relationships. Such prompts require bridging long semantic distances between objects and actions, a task to which static inference procedures and unchanging reward schemes are ill-suited. ImagerySearch remedies this by integrating prompt-guided, adaptive controls into the inference pipeline to expand or contract the candidate search space and apply context-sensitive reward functions, directly informed by the semantic composition of the input prompt (Wu et al., 16 Oct 2025).

2. Semantic Distance-Aware Adaptive Inference

A key innovation of ImagerySearch is the introduction of a semantic distance metric to guide both the breadth of candidate sampling during generation and the weighting of reward signals. For any prompt $p$ , the average semantic distance between key entities is defined as: $\bar{\mathcal{D}}_{\text{sem}}(p) = \frac{1}{|E|} \sum_{(i,j) \in E} \|\phi(p_i) - \phi(p_j)\|_2$ where $\phi$ is a prompt encoder (e.g., T5 encoder) and $E$ is the set of all relevant (object/object, object/action, or action/action) entity pairs. Based on this measure, the number of candidate samples at each selected denoising timestep $t$ is dynamically determined: $N_t = N_{\text{base}} \cdot \left(1 + \lambda \cdot \bar{\mathcal{D}}_{\text{sem}}(p)\right)$ where $N_{\text{base}}$ is the base number of candidates and $\lambda$ is a scaling factor. For prompts with large semantic distances, this increases the candidate pool, allocating more computation to challenging cases; conversely, prompts with semantically close or common entities use a reduced sampling effort. The "Imagery Schedule" $\mathcal{S}$ (e.g., $\{5, 20, 30, 45\}$ ) designates specific pivotal denoising steps at which this adaptive expansion occurs (Wu et al., 16 Oct 2025).

3. Adaptive Reward Formulation

Beyond adaptive exploration, ImagerySearch introduces an Adaptive Imagery Reward (AIR) that reweights the composite reward based on semantic prompt difficulty: $R_{\text{AIR}}(\hat{\mathbf{x}}_0) = \Bigl(\alpha \cdot \mathrm{MQ} + \beta \cdot \mathrm{TA} + \gamma \cdot \mathrm{VQ} + \omega \cdot R_{\text{any}}\Bigr) \cdot \bar{\mathcal{D}}_{\text{sem}}(\hat{\mathbf{x}}_0)$ Here, MQ is a measure of visual quality, TA refers to temporal alignment, VQ reflects image quality, $R_{\text{any}}$ is an extensible additional metric, and $\alpha, \beta, \gamma, \omega$ are scaling coefficients. This adaptive weighting ensures that for imaginative scenarios (with high semantic distance), reward emphasis is shifted towards criterion crucial for creative composition, thereby guiding the selection towards outputs better matching rare, long-range dependencies present in the prompt (Wu et al., 16 Oct 2025).

4. LDT-Bench: Benchmarking Long-Distance Semantic Generation

To assess progress in imaginative video generation, LDT-Bench (Long-Distance Test-Bench) was introduced. LDT-Bench comprises 2,839 concept pairs—object–action and action–action—to systematically stress-test models on long-distance semantic compositions. It is constructed from pairings across canonical datasets (ImageNet-1K, COCO, Kinetics-600), focusing on combinations absent or rare in standard training data. Automated evaluation is performed using the ImageryQA protocol, which assesses:

ElementQA: Correct presence and depiction of target entities and actions
AlignQA: Image quality and aesthetic alignment
AnomalyQA: Consistency and detection of abnormal or incoherent frames

The comprehensive nature of LDT-Bench ensures that improvements measured by ImagerySearch translate to genuine gains in creative and compositional video generation skill (Wu et al., 16 Oct 2025).

5. Empirical Results and Comparative Analysis

Experiments demonstrate that ImagerySearch consistently outperforms both general video generation baselines (e.g., Wan2.1, CogVideoX) and strong test-time baseline methods (e.g., Video-T1, EvoSearch) on LDT-Bench and VBench. Quantitative gains are observed across:

ElementQA: Improved ability to depict all required concepts and relations from prompts
AlignQA: Higher scores for temporal and visual coherence
AnomalyQA: Lower incidence of inconsistent or artifact-laden frames Aggregate ImageryQA scores objectively reflect these improvements, and error distribution analyses show ImagerySearch's relative insensitivity to prompt difficulty—yielding more stable and reliable video synthesis in the imaginative regime (Wu et al., 16 Oct 2025).

6. Limitations and Future Directions

Current limitations include the reliance on accurate prompt analysis for semantic-distance computation and the potential for increased computational overhead in extremely high-difficulty scenarios due to expanded search spaces. Future research directions outlined in the paper include development of more nuanced, flexible reward functions that further improve alignment with user intent in imaginative prompts; exploration of additional adaptive search strategies at inference; and integration of other semantic or perceptual metrics. Scalability to even longer sequences or more complex compositional queries remains an open area, as does extension to applications in narrative scene composition or multimedia content creation that require balancing realism and imagination (Wu et al., 16 Oct 2025).

ImagerySearch provides a methodological advance for video generation, introducing prompt-sensitive adaptive controls over both candidate space and reward structure, supported by a dedicated benchmark for evaluating imaginative scenarios. This framework enables video generators to better accommodate compositional creativity and long-range semantic dependencies impossible to handle with static, one-size-fits-all inference approaches.

PDF Markdown Chat (Pro)

References (1)

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ImagerySearch.