Greedy Join-Aware Retrieval Algorithm
- The paper presents a greedy algorithm that iteratively selects tables by balancing semantic relevance, comprehensive query coverage, and structural joinability.
- It employs an iterative search framework using coarse and fine-grained relevance signals derived from cosine similarities across table and column embeddings.
- Empirical evaluations show that the approach achieves competitive recall metrics with significant speedups compared to NP-hard MIP-based retrieval methods.
The Greedy Join-Aware Retrieval algorithm addresses the challenge of multi-table retrieval in open-domain question answering over datalakes. The algorithm seeks to select a coherent subset of tables that are (a) semantically relevant to a user's natural-language query, (b) collectively cover all sub-query requirements, and (c) exhibit structural joinability, enabling valid PK–FK joins. Prior exact approaches such as Mixed-Integer Programming (MIP) provide optimality guarantees but are computationally prohibitive; more naive greedy heuristics fail to ensure the required structural composability. The Greedy Join-Aware Retrieval approach formalizes this selection as an iterative, utility-driven search that balances multiple soft signals for scalable and composition-aware retrieval (Boutaleb et al., 17 Nov 2025).
1. Problem Formalization and Objectives
Let denote the corpus of candidate tables. Given a natural-language query decomposed into sub-queries (using an LLM), the goal is to select a subset with that maximizes (a) table-level semantic relevance, (b) coverage of all fine-grained query concepts , and (c) the propensity of to form a valid join graph as measured by precomputed pairwise joinability.
The dual-layer relevance structure comprises:
- Coarse-grained relevance: .
- Fine-grained relevance: .
Query coverage is formalized as a vector , where for :
Structural coherence is encoded in a join-compatibility matrix , with
2. Iterative-Search Framework
The selection process is orchestrated by an iterative greedy search, where each round updates a context state and augments the candidate subset.
At each step , the search maintains:
- The selected graph , with the current selection and representing pairwise joinable edges.
- The coverage vector where .
Selection and update proceed until or the coverage threshold is met:
- Selection: Pick
where scores global utility.
- Update: Augment , , and accordingly, reflecting new entries and improved coverage.
3. Utility Function and Algorithmic Instantiation
The Greedy Join-Aware Retrieval algorithm specifies a utility function designed to holistically balance relevance, query coverage, and structural joinability.
Given current context and a candidate , compute:
- Coarse relevance gain:
- Coverage gain:
- Join gain (for ):
The overall utility function (for ): For the initial step (), is undefined:
Balancing signals:
- prioritizes semantically relevant tables.
- encourages coverage of unmet sub-queries.
- promotes addition of joinable tables to preserve composability.
Ablation shows coarse relevance is the dominant signal, with coverage and joinability successively less impactful.
4. Complexity and Empirical Speedup
Complexity per query is , where (number of tables to select) and (number of query sub-concepts) are typically much smaller than (candidate pool size). This is polynomial and, in practice, almost linear in for modest and .
By contrast, the MIP-based join-aware re-ranker is NP-hard, exponential in , and manifests in wallclock runtimes ranging from minutes to hours per query (up to 11,044 s on the hardest cases). The greedy method completes retrieval in 10–100 s per query across five NL2SQL benchmarks, amounting to a $4$– speedup, depending on scenario and search space (Boutaleb et al., 17 Nov 2025).
| Method | Complexity | Typical Runtime (per query) |
|---|---|---|
| Greedy JAR | $10$–$100$ s | |
| MIP-based JAR | exponential in | $100$–$11,000$ s |
5. Empirical Evaluation
The algorithm was evaluated on five multi-table NL2SQL benchmarks: SPIDER, BIRD, FIBEN, BEAVER-DW, and BEAVER-NW. Metrics include Recall@ (fraction of gold tables in top-) and Complete Recall@ (all gold tables in top-). Baselines comprise Contriever (dense retrieval), JAR (MIP), and CRUSH⁴SQL (coverage-only greedy).
Key empirical results:
- On BIRD at , greedy achieves CR@3 = (vs. JAR ) with speedup.
- On BEAVER-NW at , greedy achieves CR@5 = (vs. JAR ) with speedup.
- Across all five benchmarks and , retrieval performance is competitive or superior to MIP-based JAR, with substantially reduced runtime.
Ablation studies:
- Omitting coarse relevance drops SPIDER R@2 from to .
- Omitting coverage reduces performance on coverage-heavy benchmarks.
- Omitting join gain adversely affects enterprise BI benchmarks, confirming the value of structural coherence in real-world schemas.
6. Qualitative Behavior and Interpretability
A case study queries for “high-school friendship pairs.” The algorithm first selects network₁.friend (maximal /), then network₁.highschooler (maximal join and remaining coverage), thereby reconstructing the correct join path. In contrast, coverage-only greedy methods select structurally incoherent tables when fine segment scores are noisy, failing to form the correct join graph.
A plausible implication is that the join-aware utility prioritization robustly mitigates retrieval fragmentation even in the presence of noisy or inconsistent fine-grained semantic signals, especially in schema-rich, enterprise environments.
7. Limitations and Future Directions
Two primary limitations emerge:
- The approach is sensitive to initial seed selection; a poor first pick can cascade to suboptimal set composition.
- The process is purely greedy, precluding overt refinement or backtracking, thus potentially excluding globally optimal retrieval sets.
Proposed research extensions include incorporation of backtracking or beam search, dynamic adaptation of operator priorities (reach beyond JOIN to UNION and aggregation), hybrid schemes where greedy is complemented by MIP fallback on difficult queries, and broad-scope evaluation over open-domain corpora with LLM-guided retrieval operator learning.
Future work explores generalized, operator-agnostic iterative retrieval frameworks intersecting with dynamic symbolic composition and learning-augmented heuristic selection (Boutaleb et al., 17 Nov 2025).