CAESURA: Teaching Language Models to Plan Multi-Modal Queries

This presentation introduces CAESURA, a GPT-4-based system that extends query planning beyond traditional SQL databases to handle multi-modal data like images and text. We explore how language models can automatically discover relevant data, construct logical query plans, and execute them across diverse modalities—demonstrating a new paradigm for querying modern data lakes where structured tables are just one piece of a much richer information landscape.
Script
Traditional databases live in a tidy world of tables and SQL, but modern data lakes are messy, filled with images, documents, and text that no query planner knows how to touch. CAESURA teaches language models to build query plans that work across all of it.
The system works in three coordinated phases. Discovery identifies which data items and columns matter for your query. Planning constructs a logical sequence of steps to answer the question. Mapping and execution choose physical operators for each step, then feed results back so the language model can adjust the next operator based on what it just learned.
That feedback loop is the key innovation. By interleaving execution with planning, the language model sees intermediate outputs and uses them to pick the next operator and refine selection conditions, catching mistakes before they cascade through the entire plan.
On multi-modal datasets that mix artwork metadata with images and sports game reports, CAESURA with GPT-4 correctly translated 87.5 percent of queries. That success rate shows language models can reason about complex query structures even when the data spans modalities traditional planners cannot handle.
Of course, challenges remain. Plan execution can still fail when operators are misapplied, and there is no cost-based optimization yet to ensure queries run efficiently. Security is another open question, since language models might generate plans that expose sensitive data in unexpected ways.
CAESURA shows that language models can serve as the bridge between natural language questions and the messy, multi-modal reality of modern data. If you want to explore how reasoning at the query planning layer unlocks new possibilities, visit EmergentMind.com to dive deeper and create your own video summaries.