Papers
Topics
Authors
Recent
Search
2000 character limit reached

Greedy Join-Aware Retrieval Algorithm

Updated 24 November 2025
  • The paper presents a greedy algorithm that iteratively selects tables by balancing semantic relevance, comprehensive query coverage, and structural joinability.
  • It employs an iterative search framework using coarse and fine-grained relevance signals derived from cosine similarities across table and column embeddings.
  • Empirical evaluations show that the approach achieves competitive recall metrics with significant speedups compared to NP-hard MIP-based retrieval methods.

The Greedy Join-Aware Retrieval algorithm addresses the challenge of multi-table retrieval in open-domain question answering over datalakes. The algorithm seeks to select a coherent subset of tables that are (a) semantically relevant to a user's natural-language query, (b) collectively cover all sub-query requirements, and (c) exhibit structural joinability, enabling valid PK–FK joins. Prior exact approaches such as Mixed-Integer Programming (MIP) provide optimality guarantees but are computationally prohibitive; more naive greedy heuristics fail to ensure the required structural composability. The Greedy Join-Aware Retrieval approach formalizes this selection as an iterative, utility-driven search that balances multiple soft signals for scalable and composition-aware retrieval (Boutaleb et al., 17 Nov 2025).

1. Problem Formalization and Objectives

Let T={T1,...,TN}\mathcal{T} = \{T_1, ..., T_N\} denote the corpus of candidate tables. Given a natural-language query QQ decomposed into MM sub-queries {qj}j=1M\{q_j\}_{j=1}^M (using an LLM), the goal is to select a subset STS \subseteq \mathcal{T} with S=K|S|=K that maximizes (a) table-level semantic relevance, (b) coverage of all fine-grained query concepts qjq_j, and (c) the propensity of SS to form a valid join graph as measured by precomputed pairwise joinability.

The dual-layer relevance structure comprises:

  • Coarse-grained relevance: ri=cos(emb(Q),emb(Ti))[1,1]r_i = \cos(\mathrm{emb}(Q),\,\mathrm{emb}(T_i)) \in [-1, 1].
  • Fine-grained relevance: Fji=maxccols(Ti)cos(emb(qj),emb(c))[1,1]F_{ji} = \max_{c\in\mathrm{cols}(T_i)} \cos(\mathrm{emb}(q_j),\,\mathrm{emb}(c)) \in [-1, 1].

Query coverage is formalized as a vector qRM\mathbf{q} \in \mathbb{R}^M, where for SS: qj(S)=maxTiSFji.q_j(S) = \max_{T_i \in S} F_{ji}.

Structural coherence is encoded in a join-compatibility matrix ω[0,1]N×N\omega \in [0, 1]^{N \times N}, with

ωilP(TiTl is a valid PK–FK join).\omega_{il} \approx P( T_i \Join T_l \text{ is a valid PK--FK join} ).

2. Iterative-Search Framework

The selection process is orchestrated by an iterative greedy search, where each round updates a context state and augments the candidate subset.

At each step kk, the search maintains:

  • The selected graph Gk=(Sk,Ek)G_k = (S_k, E_k), with SkS_k the current selection and Ek={(i,l)Ti,TlSk,  ωil>0}E_k = \{(i, l) \mid T_i, T_l \in S_k,\;\omega_{il} > 0\} representing pairwise joinable edges.
  • The coverage vector qk\mathbf{q}_k where (qk)j=maxTiSkFji(\mathbf{q}_k)_j = \max_{T_i \in S_k} F_{ji}.

Selection and update proceed until Sk=K|S_k|=K or the coverage threshold θ\theta is met:

  • Selection: Pick

Tk+1=argmaxTTSkU(T,(Gk,qk))T_{k+1} = \arg\max_{T \in \mathcal{T} \setminus S_k} U( T, (G_k, \mathbf{q}_k) )

where UU scores global utility.

  • Update: Augment SS, EE, and q\mathbf{q} accordingly, reflecting new entries and improved coverage.

3. Utility Function and Algorithmic Instantiation

The Greedy Join-Aware Retrieval algorithm specifies a utility function UU designed to holistically balance relevance, query coverage, and structural joinability.

Given current context (Sk1,qk1)(S_{k-1}, \mathbf{q}_{k-1}) and a candidate TiSk1T_i \notin S_{k-1}, compute:

  • Coarse relevance gain: Gcoarse(Ti)=riG_{\mathrm{coarse}}(T_i) = r_i
  • Coverage gain:

Gcov(Tiqk1)=j=1Mmax(0,Fji(qk1)j)G_{\mathrm{cov}}(T_i \mid \mathbf{q}_{k-1}) = \sum_{j=1}^M \max\left(0, F_{ji} - (\mathbf{q}_{k-1})_j\right)

  • Join gain (for k>1k > 1):

Gjoin(TiSk1)=TlSk1ωilG_{\mathrm{join}}(T_i \mid S_{k-1}) = \sum_{T_l \in S_{k-1}} \omega_{il}

The overall utility function (for k>1k > 1): U(Ti,(Gk1,qk1))=λcoarseri+λcovGcov(Tiqk1)+λjoinGjoin(TiSk1)U \left(T_i, (G_{k-1}, \mathbf{q}_{k-1})\right) = \lambda_{\text{coarse}} r_i + \lambda_{\text{cov}} G_{\mathrm{cov}} (T_i \mid \mathbf{q}_{k-1}) + \lambda_{\text{join}} G_{\mathrm{join}} (T_i \mid S_{k-1}) For the initial step (k=1k = 1), GjoinG_{\mathrm{join}} is undefined: U(Ti)=λcoarseri+λcovj=1MFjiU(T_i) = \lambda_{\text{coarse}} r_i + \lambda_{\text{cov}} \sum_{j=1}^M F_{ji}

Balancing signals:

  • λcoarse\lambda_{\text{coarse}} prioritizes semantically relevant tables.
  • λcov\lambda_{\text{cov}} encourages coverage of unmet sub-queries.
  • λjoin\lambda_{\text{join}} promotes addition of joinable tables to preserve composability.

Ablation shows coarse relevance is the dominant signal, with coverage and joinability successively less impactful.

4. Complexity and Empirical Speedup

Complexity per query is O(KN(M+K))O(K \cdot N \cdot (M + K)), where KK (number of tables to select) and MM (number of query sub-concepts) are typically much smaller than NN (candidate pool size). This is polynomial and, in practice, almost linear in NN for modest KK and MM.

By contrast, the MIP-based join-aware re-ranker is NP-hard, exponential in NN, and manifests in wallclock runtimes ranging from minutes to hours per query (up to 11,044 s on the hardest cases). The greedy method completes retrieval in 10–100 s per query across five NL2SQL benchmarks, amounting to a $4$–400×400\times speedup, depending on scenario and search space (Boutaleb et al., 17 Nov 2025).

Method Complexity Typical Runtime (per query)
Greedy JAR O(KN(M+K))O(K N (M + K)) $10$–$100$ s
MIP-based JAR exponential in NN $100$–$11,000$ s

5. Empirical Evaluation

The algorithm was evaluated on five multi-table NL2SQL benchmarks: SPIDER, BIRD, FIBEN, BEAVER-DW, and BEAVER-NW. Metrics include Recall@KK (fraction of gold tables in top-KK) and Complete Recall@KK (all gold tables in top-KK). Baselines comprise Contriever (dense retrieval), JAR (MIP), and CRUSH⁴SQL (coverage-only greedy).

Key empirical results:

  • On BIRD at K=3K=3, greedy achieves CR@3 = 70.5%70.5\% (vs. JAR 71.9%71.9\%) with >135×> 135\times speedup.
  • On BEAVER-NW at K=5K=5, greedy achieves CR@5 = 10.5%10.5\% (vs. JAR 0%0\%) with >240×> 240\times speedup.
  • Across all five benchmarks and K{2,3,5,10}K \in \{2,3,5,10\}, retrieval performance is competitive or superior to MIP-based JAR, with substantially reduced runtime.

Ablation studies:

  • Omitting coarse relevance drops SPIDER R@2 from 85.5%85.5\% to 71.6%71.6\%.
  • Omitting coverage reduces performance on coverage-heavy benchmarks.
  • Omitting join gain adversely affects enterprise BI benchmarks, confirming the value of structural coherence in real-world schemas.

6. Qualitative Behavior and Interpretability

A case study queries for “high-school friendship pairs.” The algorithm first selects network₁.friend (maximal rir_i/FjiF_{ji}), then network₁.highschooler (maximal join ω\omega and remaining coverage), thereby reconstructing the correct join path. In contrast, coverage-only greedy methods select structurally incoherent tables when fine segment scores are noisy, failing to form the correct join graph.

A plausible implication is that the join-aware utility prioritization robustly mitigates retrieval fragmentation even in the presence of noisy or inconsistent fine-grained semantic signals, especially in schema-rich, enterprise environments.

7. Limitations and Future Directions

Two primary limitations emerge:

  • The approach is sensitive to initial seed selection; a poor first pick can cascade to suboptimal set composition.
  • The process is purely greedy, precluding overt refinement or backtracking, thus potentially excluding globally optimal retrieval sets.

Proposed research extensions include incorporation of backtracking or beam search, dynamic adaptation of operator priorities (reach beyond JOIN to UNION and aggregation), hybrid schemes where greedy is complemented by MIP fallback on difficult queries, and broad-scope evaluation over open-domain corpora with LLM-guided retrieval operator learning.

Future work explores generalized, operator-agnostic iterative retrieval frameworks intersecting with dynamic symbolic composition and learning-augmented heuristic selection (Boutaleb et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Greedy Join-Aware Retrieval Algorithm.