FollowTable: A Benchmark for Instruction-Following Table Retrieval

Published 1 May 2026 in cs.IR and cs.CL | (2605.00400v1)

Abstract: Table Retrieval (TR) has traditionally been formulated as an ad-hoc retrieval problem, where relevance is primarily determined by topical semantic similarity. With the growing adoption of LLM-based agentic systems, access to structured data is increasingly instruction-driven, where relevance is conditional on explicit content and schema constraints rather than topical similarity alone. We therefore formalize Instruction-Following Table Retrieval (IFTR), a new task that requires models to jointly satisfy topical relevance and fine-grained instruction constraints. We identify two core challenges in IFTR: (i) sensitivity to content scope, such as inclusion and exclusion constraints, and (ii) awareness of schema-grounded requirements, including column semantics and representation granularity--capabilities largely absent in existing retrievers. To support systematic evaluation, we introduce FollowTable, the first large-scale benchmark for IFTR, constructed via a taxonomy-driven annotation pipeline. We further propose a new metric, termed the Instruction Responsiveness Score, to evaluate whether retrieval rankings consistently adapt to user instructions relative to a topic-only baseline. Our results indicate that existing retrieval models struggle to follow fine-grained instructions over tabular data. In particular, they exhibit systematic biases toward surface-level semantic cues and remain limited in handling schema-grounded constraints, highlighting substantial room for future improvements.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a novel IFTR task with a detailed instruction taxonomy to benchmark table retrieval systems on instruction compliance.
The paper presents an innovative automated pipeline that leverages LLMs for instruction generation and multi-stage quality review.
The paper demonstrates that instruction-following retrievers achieve significantly higher IRS, though often at increased computational cost.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

Introduction and Motivation

Table retrieval (TR) has historically focused on ad-hoc scenarios where the primary objective is finding tables topically relevant to a user's natural language query based on global semantic similarity. However, the rise of LLM-backed agentic systems has shifted practical requirements toward more nuanced retrieval. Users now expect systems to follow detailed instructions that specify both semantic and structural conditions—ranging from explicit inclusion/exclusion of content to fine-grained schema-related constraints. This trend necessitates a fundamental redefinition of relevance within tabular retrieval and exposes limitations in existing models that fail to model instruction-conditioned retrieval with sufficient granularity.

The FollowTable benchmark directly addresses this gap by introducing the Instruction-Following Table Retrieval (IFTR) task. IFTR demands that models not only consider semantic query–table overlap but also jointly reason over instruction-imposed constraints pertaining to table content and schema. The benchmark is underpinned by a taxonomy of instruction types explicitly dissecting the constraint space and is supported with large-scale, systematically annotated data and a novel metric for measuring instruction compliance.

Figure 1: Ad-hoc retrieval relies solely on query–table topical similarity, while instruction-following retrieval demotes tables violating explicit user-defined constraints.

Task and Taxonomy Formulation

From Ad-hoc TR to IFTR

In contrast to standard TR, where relevance is query-centric, IFTR introduces an instruction variable. This variable refines and possibly overrides the relevance signal based on specific requirements: e.g., inclusion (must contain X), exclusion (must not mention Y), exclusivity (focus solely on Z), column (schema) requirements, or aggregation granularity constraints. Only tables that both match the query and strictly satisfy the instruction are considered relevant.

Fine-Grained Constraint Taxonomy

The instruction space in IFTR is formalized into two high-level categories:

Content-scope Constraints: Narrow or filter the allowed content, subdivided into semantic boundary (C1) and exclusive topic (C2) constraints.
Schema-grounded Structural Constraints: Instruct on table organization, with subtypes for attribute-centric (S1), entity-centric (S2), and granularity-centric (S3) requirements.
Figure 2: Taxonomy of IFTR instructions: content-scope constraints (purple) and schema-grounded structural constraints (blue), illustrated with examples and edge cases.

This taxonomy enables systematic evaluation of model capabilities across the multifaceted instruction space, and its design allows the annotation pipeline to automatically produce both natural-language instructions and ground-truth instruction-compliant/violating tables.

Construction of the FollowTable Benchmark

Data Collection and Preparation

To support IFTR, FollowTable unifies and extends multiple open-domain and domain-specific TR datasets, and introduces a new Industrial Standard Manuals set, covering complex structured data from technical documentation. The preparation pipeline, leveraging LLMs for parsing, generalization, translation, and validation, ensures that each query is associated with multiple topically relevant tables, creating a scenario amenable to instruction-based refinement.

Figure 3: Data preprocessing: transforming raw datasets and technical manuals into standardized query–table pairs with multiple relevant tables.

Automated Instruction Generation and Annotation

FollowTable's automated instruction generation pipeline uses LLMs for semantic subtopic mining, categorical instruction synthesis (across all taxonomy subtypes), detailed re-labeling, synthetic balancing (to ensure minimum positive/negative coverage), and iterative LLM-as-a-Critic review. This critic evaluates instruction clarity, query dependency, logical consistency, category focus, and abstraction appropriateness.

Figure 4: Automated pipeline for instruction generation and multi-stage LLM-driven quality review, ensuring taxonomy coverage and logical validity.

After LLM-based curation, human annotators further validate instruction–table alignment and inter-annotator agreement is high ( $\kappa \approx 0.73$ ).

Evaluation Metrics for IFTR

Traditional IR metrics, like nDCG and p-MRR, inadequately capture instruction-following ability because they only reward topical relevance. FollowTable introduces the Instruction Responsiveness Score (IRS), which quantifies the extent to which the model reorders instruction-compliant tables upward and instruction-violating tables downward relative to a topic-only baseline. IRS exhibits desired properties: sensitivity, monotonicity, and robustness to edge cases (such as when the topic-only baseline is already optimal).

Figure 5: IRS demonstrates strict monotonicity under controlled degradation of instruction compliance, unlike nDCG and p-MRR which fail to consistently reflect instruction-induced ranking changes.

Experimental Analysis

Model Evaluation

FollowTable benchmarks three model categories:

General-purpose retrievers (e.g., BGE, E5, OpenAI)
Table-specialized retrievers (e.g., Tapas-DTR, Table-GTR, Birdie)
Instruction-following retrievers (e.g., INSTRUCTOR, GritLM, Promptriever, Qwen3-Emb)

Key findings:

Table-specialized retrievers only outperform others on in-domain data, lacking cross-domain robustness and instruction-following capability.
Instruction-following models substantially improve IRS (e.g., Promptriever yields IRS ~26.2 vs. <5 for most non-instructional models), but performance remains far from optimal.
Re-rankers and list-wise LLM-based models (e.g., Gemini-3-Pro) can close a significant fraction of the gap, but at high computational cost.

Performance Disparities by Constraint Type

Entity-centric structure constraints (S2) are consistently the most challenging, while semantic boundary (C1) exhibits the highest instruction compliance for all architectures.

Figure 6: nDCG@10 and IRS breakdown by instruction type, showing increasing model difficulty from content-scope to schema-centric constraints.

Negation Sensitivity and Positive Attention Bias

Dense retrievers and even several instruction-aware ones demonstrate high Negation Failure Rates (NFR > 0.6), indicating a systematic "positive attention bias" toward tables containing excluded entities, undermining logical negation handling.

Figure 7: Rank shift analysis under attribute-centric structural instructions; even SOTA retriever models often promote instruction-violating tables (red/orange) due to insufficient constraint understanding.

Re-rankers vs. Retrievers

While state-of-the-art re-rankers (e.g., Rank1, FollowIR, Gemini-3-Pro) achieve significant IRS gains (up to +58.1 compared to bi-encoders), they incur 1–2 orders of magnitude higher latency. This shift illustrates the cost–accuracy tradeoff inherent to modeling complex instruction adherence.

Implications and Future Directions

The FollowTable benchmark surfaces fundamental limitations in current table retrievers:

Structural encoding alone is not enough—models require explicit instruction-based alignment during training to reason over schema-grounded constraints.
General-purpose retrievers, despite strong ad-hoc performance and robustness, lack mechanisms for fine-grained instruction-following unless instruction tuning is employed.
Performance on instruction-following cannot be evaluated by traditional IR metrics, necessitating tailored metrics like IRS.
Current retrievers and cross-encoders remain susceptible to logical inversion (negation) failures, warranting research on reasoning-augmented and logic-aware architectures.

Practically, these results have direct consequences for the design and deployment of agentic retrieval-augmented generation (RAG) systems: effective deployment in user-facing or multi-modal assistants mandates accurate instruction-following, not just topical relevance.

Theoretically, the IFTR paradigm formalized and benchmarked in FollowTable constitutes a new compositional reasoning challenge for information retrieval, requiring advances in multi-view encoding, natural-language reasoning over schema, and instruction-driven representation learning.

Conclusion

FollowTable establishes the IFTR task and enables systematic measurement—in both breadth and depth—of instruction-following in table retrieval. Empirical evidence emphasizes both the need for architectures explicitly sensitive to instruction constraints and for instruction-aligned pre-training objectives. As IFTR reflects emerging user interaction paradigms in structured data retrieval, further research is needed on ultra-efficient, logic-sensitive, and instruction-conditioned models—potentially integrating symbolic reasoning or advanced prompt-tuning—toward robust, instruction-following retrieval systems for next-generation AI agents.