ImagerySearch: Adaptive Video Generation
- ImagerySearch is an adaptive video generation strategy that dynamically adjusts the candidate sampling and reward functions based on the semantic structure of input prompts.
- It incorporates a semantic distance metric to tailor the search space, thereby enhancing coherence and managing long-distance dependencies in video synthesis.
- Evaluated on LDT-Bench, ImagerySearch outperforms standard models in accurately depicting video elements, ensuring temporal alignment, and reducing visual anomalies.
ImagerySearch is an adaptive test-time search strategy developed to address the limitations of current video generation models—especially their inability to reliably generate coherent and plausible videos for imaginative prompts involving rarely co-occurring concepts and long-distance semantic relationships. By dynamically adjusting both inference search space and reward functions based on the semantic structure of a prompt, ImagerySearch achieves marked improvements in visual coherence and semantic alignment, particularly for prompts that lie outside of conventional training distributions. This strategy is evaluated using LDT-Bench, the first benchmark dedicated to long-distance semantic video generation scenarios, and outperforms strong test-time scaling and baseline models on this suite as well as on established general video benchmarks.
1. Overview and Motivation
ImagerySearch was developed to overcome the deficiency of existing video generation frameworks in handling imaginative or compositionally novel prompts. Video models typically excel in realistic scenarios—those with objects and actions that co-occur frequently in training data—but their performance degrades substantially for prompts involving rare, semantically distant, or compositionally challenging entity relationships. Such prompts require bridging long semantic distances between objects and actions, a task to which static inference procedures and unchanging reward schemes are ill-suited. ImagerySearch remedies this by integrating prompt-guided, adaptive controls into the inference pipeline to expand or contract the candidate search space and apply context-sensitive reward functions, directly informed by the semantic composition of the input prompt (Wu et al., 16 Oct 2025).
2. Semantic Distance-Aware Adaptive Inference
A key innovation of ImagerySearch is the introduction of a semantic distance metric to guide both the breadth of candidate sampling during generation and the weighting of reward signals. For any prompt , the average semantic distance between key entities is defined as: where is a prompt encoder (e.g., T5 encoder) and is the set of all relevant (object/object, object/action, or action/action) entity pairs. Based on this measure, the number of candidate samples at each selected denoising timestep is dynamically determined: where is the base number of candidates and is a scaling factor. For prompts with large semantic distances, this increases the candidate pool, allocating more computation to challenging cases; conversely, prompts with semantically close or common entities use a reduced sampling effort. The "Imagery Schedule" (e.g., ) designates specific pivotal denoising steps at which this adaptive expansion occurs (Wu et al., 16 Oct 2025).
3. Adaptive Reward Formulation
Beyond adaptive exploration, ImagerySearch introduces an Adaptive Imagery Reward (AIR) that reweights the composite reward based on semantic prompt difficulty: Here, MQ is a measure of visual quality, TA refers to temporal alignment, VQ reflects image quality, is an extensible additional metric, and are scaling coefficients. This adaptive weighting ensures that for imaginative scenarios (with high semantic distance), reward emphasis is shifted towards criterion crucial for creative composition, thereby guiding the selection towards outputs better matching rare, long-range dependencies present in the prompt (Wu et al., 16 Oct 2025).
4. LDT-Bench: Benchmarking Long-Distance Semantic Generation
To assess progress in imaginative video generation, LDT-Bench (Long-Distance Test-Bench) was introduced. LDT-Bench comprises 2,839 concept pairs—object–action and action–action—to systematically stress-test models on long-distance semantic compositions. It is constructed from pairings across canonical datasets (ImageNet-1K, COCO, Kinetics-600), focusing on combinations absent or rare in standard training data. Automated evaluation is performed using the ImageryQA protocol, which assesses:
- ElementQA: Correct presence and depiction of target entities and actions
- AlignQA: Image quality and aesthetic alignment
- AnomalyQA: Consistency and detection of abnormal or incoherent frames
The comprehensive nature of LDT-Bench ensures that improvements measured by ImagerySearch translate to genuine gains in creative and compositional video generation skill (Wu et al., 16 Oct 2025).
5. Empirical Results and Comparative Analysis
Experiments demonstrate that ImagerySearch consistently outperforms both general video generation baselines (e.g., Wan2.1, CogVideoX) and strong test-time baseline methods (e.g., Video-T1, EvoSearch) on LDT-Bench and VBench. Quantitative gains are observed across:
- ElementQA: Improved ability to depict all required concepts and relations from prompts
- AlignQA: Higher scores for temporal and visual coherence
- AnomalyQA: Lower incidence of inconsistent or artifact-laden frames Aggregate ImageryQA scores objectively reflect these improvements, and error distribution analyses show ImagerySearch's relative insensitivity to prompt difficulty—yielding more stable and reliable video synthesis in the imaginative regime (Wu et al., 16 Oct 2025).
6. Limitations and Future Directions
Current limitations include the reliance on accurate prompt analysis for semantic-distance computation and the potential for increased computational overhead in extremely high-difficulty scenarios due to expanded search spaces. Future research directions outlined in the paper include development of more nuanced, flexible reward functions that further improve alignment with user intent in imaginative prompts; exploration of additional adaptive search strategies at inference; and integration of other semantic or perceptual metrics. Scalability to even longer sequences or more complex compositional queries remains an open area, as does extension to applications in narrative scene composition or multimedia content creation that require balancing realism and imagination (Wu et al., 16 Oct 2025).
ImagerySearch provides a methodological advance for video generation, introducing prompt-sensitive adaptive controls over both candidate space and reward structure, supported by a dedicated benchmark for evaluating imaginative scenarios. This framework enables video generators to better accommodate compositional creativity and long-range semantic dependencies impossible to handle with static, one-size-fits-all inference approaches.