Greedy Join-Aware Retrieval Algorithm

Updated 24 November 2025

The paper presents a greedy algorithm that iteratively selects tables by balancing semantic relevance, comprehensive query coverage, and structural joinability.
It employs an iterative search framework using coarse and fine-grained relevance signals derived from cosine similarities across table and column embeddings.
Empirical evaluations show that the approach achieves competitive recall metrics with significant speedups compared to NP-hard MIP-based retrieval methods.

The Greedy Join-Aware Retrieval algorithm addresses the challenge of multi-table retrieval in open-domain question answering over datalakes. The algorithm seeks to select a coherent subset of tables that are (a) semantically relevant to a user's natural-language query, (b) collectively cover all sub-query requirements, and (c) exhibit structural joinability, enabling valid PK–FK joins. Prior exact approaches such as Mixed-Integer Programming (MIP) provide optimality guarantees but are computationally prohibitive; more naive greedy heuristics fail to ensure the required structural composability. The Greedy Join-Aware Retrieval approach formalizes this selection as an iterative, utility-driven search that balances multiple soft signals for scalable and composition-aware retrieval (Boutaleb et al., 17 Nov 2025).

1. Problem Formalization and Objectives

Let $\mathcal{T} = \{T_1, ..., T_N\}$ denote the corpus of candidate tables. Given a natural-language query $Q$ decomposed into $M$ sub-queries $\{q_j\}_{j=1}^M$ (using an LLM), the goal is to select a subset $S \subseteq \mathcal{T}$ with $|S|=K$ that maximizes (a) table-level semantic relevance, (b) coverage of all fine-grained query concepts $q_j$ , and (c) the propensity of $S$ to form a valid join graph as measured by precomputed pairwise joinability.

The dual-layer relevance structure comprises:

Coarse-grained relevance: $r_i = \cos(\mathrm{emb}(Q),\,\mathrm{emb}(T_i)) \in [-1, 1]$ .
Fine-grained relevance: $F_{ji} = \max_{c\in\mathrm{cols}(T_i)} \cos(\mathrm{emb}(q_j),\,\mathrm{emb}(c)) \in [-1, 1]$ .

Query coverage is formalized as a vector $Q$ 0, where for $Q$ 1: $Q$ 2

Structural coherence is encoded in a join-compatibility matrix $Q$ 3, with

$Q$ 4

2. Iterative-Search Framework

The selection process is orchestrated by an iterative greedy search, where each round updates a context state and augments the candidate subset.

At each step $Q$ 5, the search maintains:

The selected graph $Q$ 6, with $Q$ 7 the current selection and $Q$ 8 representing pairwise joinable edges.
The coverage vector $Q$ 9 where $M$ 0.

Selection and update proceed until $M$ 1 or the coverage threshold $M$ 2 is met:

Selection: Pick

$M$ 3

where $M$ 4 scores global utility.

Update: Augment $M$ 5, $M$ 6, and $M$ 7 accordingly, reflecting new entries and improved coverage.

3. Utility Function and Algorithmic Instantiation

The Greedy Join-Aware Retrieval algorithm specifies a utility function $M$ 8 designed to holistically balance relevance, query coverage, and structural joinability.

Given current context $M$ 9 and a candidate $\{q_j\}_{j=1}^M$ 0, compute:

Coarse relevance gain: $\{q_j\}_{j=1}^M$ 1
Coverage gain:

$\{q_j\}_{j=1}^M$ 2

Join gain (for $\{q_j\}_{j=1}^M$ 3):

$\{q_j\}_{j=1}^M$ 4

The overall utility function (for $\{q_j\}_{j=1}^M$ 5): $\{q_j\}_{j=1}^M$ 6 For the initial step ( $\{q_j\}_{j=1}^M$ 7), $\{q_j\}_{j=1}^M$ 8 is undefined: $\{q_j\}_{j=1}^M$ 9

Balancing signals:

$S \subseteq \mathcal{T}$ 0 prioritizes semantically relevant tables.
$S \subseteq \mathcal{T}$ 1 encourages coverage of unmet sub-queries.
$S \subseteq \mathcal{T}$ 2 promotes addition of joinable tables to preserve composability.

Ablation shows coarse relevance is the dominant signal, with coverage and joinability successively less impactful.

4. Complexity and Empirical Speedup

Complexity per query is $S \subseteq \mathcal{T}$ 3, where $S \subseteq \mathcal{T}$ 4 (number of tables to select) and $S \subseteq \mathcal{T}$ 5 (number of query sub-concepts) are typically much smaller than $S \subseteq \mathcal{T}$ 6 (candidate pool size). This is polynomial and, in practice, almost linear in $S \subseteq \mathcal{T}$ 7 for modest $S \subseteq \mathcal{T}$ 8 and $S \subseteq \mathcal{T}$ 9.

By contrast, the MIP-based join-aware re-ranker is NP-hard, exponential in $|S|=K$ 0, and manifests in wallclock runtimes ranging from minutes to hours per query (up to 11,044 s on the hardest cases). The greedy method completes retrieval in 10–100 s per query across five NL2SQL benchmarks, amounting to a $|S|=K$ 1– $|S|=K$ 2 speedup, depending on scenario and search space (Boutaleb et al., 17 Nov 2025).

Method	Complexity	Typical Runtime (per query)
Greedy JAR	$\|S\|=K$ 3	$\|S\|=K$ 4– $\|S\|=K$ 5 s
MIP-based JAR	exponential in $\|S\|=K$ 6	$\|S\|=K$ 7– $\|S\|=K$ 8 s

5. Empirical Evaluation

The algorithm was evaluated on five multi-table NL2SQL benchmarks: SPIDER, BIRD, FIBEN, BEAVER-DW, and BEAVER-NW. Metrics include Recall@ $|S|=K$ 9 (fraction of gold tables in top- $q_j$ 0) and Complete Recall@ $q_j$ 1 (all gold tables in top- $q_j$ 2). Baselines comprise Contriever (dense retrieval), JAR (MIP), and CRUSH⁴SQL (coverage-only greedy).

Key empirical results:

On BIRD at $q_j$ 3, greedy achieves CR@3 = $q_j$ 4 (vs. JAR $q_j$ 5) with $q_j$ 6 speedup.
On BEAVER-NW at $q_j$ 7, greedy achieves CR@5 = $q_j$ 8 (vs. JAR $q_j$ 9) with $S$ 0 speedup.
Across all five benchmarks and $S$ 1, retrieval performance is competitive or superior to MIP-based JAR, with substantially reduced runtime.

Ablation studies:

Omitting coarse relevance drops SPIDER R@2 from $S$ 2 to $S$ 3.
Omitting coverage reduces performance on coverage-heavy benchmarks.
Omitting join gain adversely affects enterprise BI benchmarks, confirming the value of structural coherence in real-world schemas.

6. Qualitative Behavior and Interpretability

A case study queries for “high-school friendship pairs.” The algorithm first selects network₁.friend (maximal $S$ 4/ $S$ 5), then network₁.highschooler (maximal join $S$ 6 and remaining coverage), thereby reconstructing the correct join path. In contrast, coverage-only greedy methods select structurally incoherent tables when fine segment scores are noisy, failing to form the correct join graph.

A plausible implication is that the join-aware utility prioritization robustly mitigates retrieval fragmentation even in the presence of noisy or inconsistent fine-grained semantic signals, especially in schema-rich, enterprise environments.

7. Limitations and Future Directions

Two primary limitations emerge:

The approach is sensitive to initial seed selection; a poor first pick can cascade to suboptimal set composition.
The process is purely greedy, precluding overt refinement or backtracking, thus potentially excluding globally optimal retrieval sets.

Proposed research extensions include incorporation of backtracking or beam search, dynamic adaptation of operator priorities (reach beyond JOIN to UNION and aggregation), hybrid schemes where greedy is complemented by MIP fallback on difficult queries, and broad-scope evaluation over open-domain corpora with LLM-guided retrieval operator learning.

Future work explores generalized, operator-agnostic iterative retrieval frameworks intersecting with dynamic symbolic composition and learning-augmented heuristic selection (Boutaleb et al., 17 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Exploring Multi-Table Retrieval Through Iterative Search (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Greedy Join-Aware Retrieval Algorithm.

Greedy Join-Aware Retrieval Algorithm

1. Problem Formalization and Objectives

2. Iterative-Search Framework

3. Utility Function and Algorithmic Instantiation

4. Complexity and Empirical Speedup

5. Empirical Evaluation

6. Qualitative Behavior and Interpretability

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Greedy Join-Aware Retrieval Algorithm

1. Problem Formalization and Objectives

2. Iterative-Search Framework

3. Utility Function and Algorithmic Instantiation

4. Complexity and Empirical Speedup

5. Empirical Evaluation

6. Qualitative Behavior and Interpretability

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research