Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contextual Knowledge Pursuit for Faithful Visual Synthesis (2311.17898v3)

Published 29 Nov 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Modern text-to-vision generative models often hallucinate when the prompt describing the scene to be generated is underspecified. In LLMs, a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database. While such retrieval augmentation strategies have great potential to enhance text-to-vision generators, existing static top-K retrieval methods explore the knowledge pool once, missing the broader context necessary for high-quality generation. Furthermore, LLMs internally possess rich world knowledge learned during large-scale training (parametric knowledge) that could mitigate the need for external data retrieval. This paper proposes Contextual Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content. Instead of the one-time retrieval of facts from an external database to improve a given prompt, CKPT uses (1) an LLM to decide whether to seek external knowledge or to self-elicit descriptions from LLM parametric knowledge, (2) a knowledge pursuit process to contextually seek and sequentially gather most relevant facts, (3) a knowledge aggregator for prompt enhancement with the gathered fact context, and (4) a filtered fine-tuning objective to improve visual synthesis with richer prompts. We evaluate CKPT across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of rare objects and daily scenarios. Our results show that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jinqi Luo (13 papers)
  2. Kwan Ho Ryan Chan (15 papers)
  3. Dimitris Dimos (1 paper)
  4. René Vidal (154 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.