Terminology Concordant Queries (TCQ)
- TCQ is a rigorously defined information retrieval query that includes exact domain-specific technical terms to ensure answerability from specified passages.
- The TCQ generation pipeline leverages document layout detection, passage chunking, and iterative LLM-guided densification to inject precise terminology.
- TCQs enable controlled benchmarking of IR models by contrasting lexical matching capabilities with terminology agnostic queries, highlighting practical retrieval strengths and limitations.
A Terminology Concordant Query (TCQ) is a rigorously constructed information retrieval (IR) query that, by definition, contains one or more domain-specific technical terms verbatim and is answerable from a specified source passage. TCQs were introduced in the context of the STELLA framework for aerospace IR benchmarking, which provides controlled, paired query sets to disentangle and analyze lexical and semantic matching in embedding and retrieval models (Kim, 7 Jan 2026). Each TCQ is designed to stress-test a system’s ability to perform exact lexical matching against domain terminology, in contrast to Terminology Agnostic Queries (TAQs), which omit surface-form technical terms and instead employ descriptive paraphrases.
1. Formal Definition and Construction
Given a passage set (e.g., from NASA Technical Reports) and a domain-specific terminology dictionary constructed from , a TCQ is defined as follows:
Here, denotes the set of terms from the dictionary present in passage , is a finite set of information-seeking objectives (including definitions, numeric queries, procedural/operational, component, and anomaly categories), and is a query generation function producing exactly one TCQ per pair. Every TCQ includes at least one member of verbatim (preserving original spelling/case/hyphenation) and is guaranteed to be answerable from . In contrast, a TAQ constructed for the same passage and intent excludes all from its surface form (Kim, 7 Jan 2026).
2. Generation Pipeline in the STELLA Framework
The process for constructing TCQs (as implemented in STELLA) consists of the following stages:
- Document Layout Detection: NASA Technical Report PDFs are processed using DocLayout-YOLO, detecting and ordering text regions sequentially while filtering low-confidence non-text blocks (confidence < 0.25).
- Passage Chunking: The Recursive-Token-Chunker splits text into overlapping 100-token chunks, yielding a set of passages (approximately 2.4 million passages).
- Terminology Dictionary Construction:
- Regex-based extraction of acronyms (e.g., CFD), hyphenated compounds (e.g., Navier-Stokes), technical notation (e.g., 3-sigma, HO).
- Filtering involves: document-frequency , part-of-speech restrictions to (proper) nouns, and specificity threshold via wordfreq, selecting rare or technical language.
- Candidate Passage Selection: Passages with are intent-classified via prompt-based GPT-5 and clustered via EmbeddingGemma-300m and -medoids (); 100 passages per intent are selected (total 500).
- Dual-type Query Generation (TCQ focus):
- For passage , intent , query generation proceeds in three iterative LLM-guided steps:
- Seed: Generate an intent-compliant question omitting all terms in .
- First Densification: One term is injected verbatim via Chain-of-Density (CoD) and validated through self-reflection for answerability, format, length, and intent compliance.
- Second Densification: Another term is injected.
- At each step, recognized and added entities are tracked, and the process halts if format/length/intent criteria are not met. The final query is returned as the TCQ.
- For passage , intent , query generation proceeds in three iterative LLM-guided steps:
Pseudocode:
1 2 3 4 5 6 |
seed ← LLM.query_seed(p, i, ban_terms=T_p) t_a, t_b ← sample_two_distinct_terms(T_p) q₁ ← seed q₂ ← LLM.coDense(q₁, add_term=t_a, self_reflect=True) q₃ ← LLM.coDense(q₂, add_term=t_b, self_reflect=True) return q₃ |
3. Evaluation Methods and Lexical-Matching Metrics
TCQs are used to measure the lexical-matching capability of IR models, for which classical probabilistic IR metrics such as BM25 are utilized. Given a query and passage :
where is the term frequency, is passage length, is the average passage length, , , and is the inverse document frequency. Retrieval is evaluated via normalized discounted cumulative gain at top- (nDCG@k):
The lexical-dependency gap for a model and intent is defined as the nDCG@10 difference between TCQs and TAQs:
Large values indicate models reliant on exact lexical overlap, while small values indicate stronger semantic matching (Kim, 7 Jan 2026).
4. Illustrative Example
Consider the following passage and terminology:
- Passage: "The LOX/hydrocarbon propellant combination in this stage features a staged combustion cycle. Chamber pressure is maintained at 10 MPa, and injector design mitigates combustion instability."
- Terminology: = { "propellant", "staged combustion cycle", "combustion instability", ... }
- Intent: Procedure/Operation.
Step-wise TCQ Generation:
- Seed (no terms): "What pressure and design features control stable operation in this engine stage?"
- Densify (+“propellant”): "What pressure and propellant flow arrangements ensure stable operation in this engine stage?"
- Densify again (+“staged combustion cycle”): "What propellant flow and staged combustion cycle parameters maintain stable operation at 10 MPa?"
- Final TCQ: "What propellant flow and staged combustion cycle parameters maintain stable operation at 10 MPa?"
This process strictly enforces the inclusion of at least one technical term verbatim from the source and yields queries with high terminological fidelity suitable for lexical-matching metrics (Kim, 7 Jan 2026).
5. Comparative Role versus Terminology Agnostic Queries (TAQs)
Each (TCQ, passage) pair is complemented with a TAQ, generated by paraphrasing technical terms via context-derived explanations (e.g., substituting "propellant" with "chemical substance burned to generate thrust"). The use of these dual query types permits a principled, quantitative disentanglement of lexical and semantic retrieval capacity for IR models. BM25 ranking assesses pure lexical overlap, while dense-embedding retrieval (e.g., Llama-Embed-Nemotron) evaluates semantic matching. The paired query construction allows the controlled measurement of how much retrieval performance is attributable to surface-form term recognition versus semantic generalization.
6. Strengths and Limitations
Benefits:
- TCQs afford precise measurement of lexical retrieval, since each query contains explicit technical terms, thereby isolating the ability of models to match on surface form.
- TCQs, when paired with TAQs, enable controlled analysis of the lexical-semantic matching spectrum and quantify the retrieval dependence on direct terminology overlap.
- The construction of TCQs reflects real-world domain practices: engineers and practitioners often use exact part names, acronyms, and domain-specific jargon in practical search scenarios, enhancing ecological validity.
Limitations:
- TCQs are machine-generated and can exhibit a uniform or over-structured style compared to genuine user queries.
- They emphasize lexical matching exclusively and thus do not test semantic retrieval for synonymy or innovative paraphrasing of terms.
- TCQs are restricted to queries answerable from a single passage, excluding broader multi-hop reasoning or negative (unanswerable) query types.
- Cross-lingual TCQs retain English technical terms, which may not always mirror actual user translation/reformulation practices in non-English contexts (Kim, 7 Jan 2026).
A plausible implication is that while TCQs provide a robust framework for benchmarking lexical retrieval, a holistic evaluation of IR systems also requires complementary query formats such as TAQs and realistic user logs.
7. Applications and Impact in Domain-Specific IR Benchmarking
TCQs form a core component of the STELLA aerospace IR benchmark, which enables reproducible and interpretable evaluation of lexical and semantic search models in technical document collections. Their design allows benchmarking classical methods (e.g., BM25) directly against deep embedding models, with evidence showing that lexical methods remain competitive in technical domains where exact term matching is essential. The approach pioneered by TCQs advances domain-specific IR benchmarking by ensuring terminological rigor and systematic measurement of retrieval models’ handling of surface-form terminology, which is crucial in high-stakes engineering and scientific information systems (Kim, 7 Jan 2026).