Terminology Concordant Queries (TCQ)

Updated 14 January 2026

TCQ is a rigorously defined information retrieval query that includes exact domain-specific technical terms to ensure answerability from specified passages.
The TCQ generation pipeline leverages document layout detection, passage chunking, and iterative LLM-guided densification to inject precise terminology.
TCQs enable controlled benchmarking of IR models by contrasting lexical matching capabilities with terminology agnostic queries, highlighting practical retrieval strengths and limitations.

A Terminology Concordant Query (TCQ) is a rigorously constructed information retrieval (IR) query that, by definition, contains one or more domain-specific technical terms verbatim and is answerable from a specified source passage. TCQs were introduced in the context of the STELLA framework for aerospace IR benchmarking, which provides controlled, paired query sets to disentangle and analyze lexical and semantic matching in embedding and retrieval models (Kim, 7 Jan 2026). Each TCQ is designed to stress-test a system’s ability to perform exact lexical matching against domain terminology, in contrast to Terminology Agnostic Queries (TAQs), which omit surface-form technical terms and instead employ descriptive paraphrases.

1. Formal Definition and Construction

Given a passage set $P = \{p_1, ..., p_N\}$ (e.g., from NASA Technical Reports) and a domain-specific terminology dictionary $T = \{t_1, ..., t_M\}$ constructed from $P$ , a TCQ is defined as follows:

$Q_{TCQ} = \{ q \in Q \mid \exists p \in P, \exists t \in T_p \subseteq T, \exists i \in \mathrm{Intent} : q = \mathrm{TCQGen}(p, T_p, i) \land t \in q \}$

Here, $T_p = T \cap \mathrm{tokens}(p)$ denotes the set of terms from the dictionary present in passage $p$ , $\mathrm{Intent}$ is a finite set of information-seeking objectives (including definitions, numeric queries, procedural/operational, component, and anomaly categories), and $\mathrm{TCQGen}$ is a query generation function producing exactly one TCQ per $(p, i)$ pair. Every TCQ includes at least one member of $T_p$ verbatim (preserving original spelling/case/hyphenation) and is guaranteed to be answerable from $p$ . In contrast, a TAQ constructed for the same passage and intent excludes all $t \in T_p$ from its surface form (Kim, 7 Jan 2026).

2. Generation Pipeline in the STELLA Framework

The process for constructing TCQs (as implemented in STELLA) consists of the following stages:

Document Layout Detection: NASA Technical Report PDFs are processed using DocLayout-YOLO, detecting and ordering text regions sequentially while filtering low-confidence non-text blocks (confidence < 0.25).
Passage Chunking: The Recursive-Token-Chunker splits text into overlapping 100-token chunks, yielding a set of passages $P$ (approximately 2.4 million passages).
Terminology Dictionary Construction:
- Regex-based extraction of acronyms (e.g., CFD), hyphenated compounds (e.g., Navier-Stokes), technical notation (e.g., 3-sigma, H $_2$ O).
- Filtering involves: document-frequency $\geq 10$ , part-of-speech restrictions to (proper) nouns, and specificity threshold $\tau \leq 3.5$ via wordfreq, selecting rare or technical language.
Candidate Passage Selection: Passages with $|T_p| \geq 5$ are intent-classified via prompt-based GPT-5 and clustered via EmbeddingGemma-300m and $k$ -medoids ( $k=5$ ); 100 passages per intent are selected (total 500).
Dual-type Query Generation (TCQ focus):
- For passage $p$ $p$ , intent $i$ $i$ , query generation proceeds in three iterative LLM-guided steps:
  - Seed: Generate an intent-compliant question omitting all terms in $T_p$ .
  - First Densification: One term $t_a \in T_p$ is injected verbatim via Chain-of-Density (CoD) and validated through self-reflection for answerability, format, length, and intent compliance.
  - Second Densification: Another term $t_b \ne t_a$ is injected.
- At each step, recognized and added entities are tracked, and the process halts if format/length/intent criteria are not met. The final query $q_3$ is returned as the TCQ.

Pseudocode:

seed ← LLM.query_seed(p, i, ban_terms=T_p)
t_a, t_b ← sample_two_distinct_terms(T_p)
q₁ ← seed
q₂ ← LLM.coDense(q₁, add_term=t_a, self_reflect=True)
q₃ ← LLM.coDense(q₂, add_term=t_b, self_reflect=True)
return q₃

When producing non-English TCQs, the framework preserves technical terminology in English while translating only the surrounding descriptive text.

3. Evaluation Methods and Lexical-Matching Metrics

TCQs are used to measure the lexical-matching capability of IR models, for which classical probabilistic IR metrics such as BM25 are utilized. Given a query $q$ and passage $p$ :

$\mathrm{BM25}(q,p) = \sum_{w \in q} \mathrm{IDF}(w) \cdot \frac{f(w,p) \cdot (k_1+1)}{f(w,p) + k_1 \cdot (1-b + b \cdot |p|/\mathrm{avgdl})}$

where $f(w,p)$ is the term frequency, $|p|$ is passage length, $\mathrm{avgdl}$ is the average passage length, $k_1 \approx 1.2$ , $b \approx 0.75$ , and $\mathrm{IDF}(w)$ is the inverse document frequency. Retrieval is evaluated via normalized discounted cumulative gain at top- $k$ (nDCG@k):

$\mathrm{DCG}@k = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)}, \quad \mathrm{nDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}$

The lexical-dependency gap $\Delta_i(M)$ for a model $M$ and intent $i$ is defined as the nDCG@10 difference between TCQs and TAQs:

$\Delta_i(M) = \mathrm{nDCG}@10_M(\mathrm{TCQ}) - \mathrm{nDCG}@10_M(\mathrm{TAQ})$

Large $\Delta_i(M)$ values indicate models reliant on exact lexical overlap, while small values indicate stronger semantic matching (Kim, 7 Jan 2026).

4. Illustrative Example

Consider the following passage and terminology:

Passage: "The LOX/hydrocarbon propellant combination in this stage features a staged combustion cycle. Chamber pressure is maintained at 10 MPa, and injector design mitigates combustion instability."
Terminology: $T_p$ = { "propellant", "staged combustion cycle", "combustion instability", ... }
Intent: Procedure/Operation.

Step-wise TCQ Generation:

Seed (no terms): "What pressure and design features control stable operation in this engine stage?"
Densify (+“propellant”): "What pressure and propellant flow arrangements ensure stable operation in this engine stage?"
Densify again (+“staged combustion cycle”): "What propellant flow and staged combustion cycle parameters maintain stable operation at 10 MPa?"
Final TCQ: "What propellant flow and staged combustion cycle parameters maintain stable operation at 10 MPa?"

This process strictly enforces the inclusion of at least one technical term verbatim from the source and yields queries with high terminological fidelity suitable for lexical-matching metrics (Kim, 7 Jan 2026).

5. Comparative Role versus Terminology Agnostic Queries (TAQs)

Each (TCQ, passage) pair is complemented with a TAQ, generated by paraphrasing technical terms via context-derived explanations (e.g., substituting "propellant" with "chemical substance burned to generate thrust"). The use of these dual query types permits a principled, quantitative disentanglement of lexical and semantic retrieval capacity for IR models. BM25 ranking assesses pure lexical overlap, while dense-embedding retrieval (e.g., Llama-Embed-Nemotron) evaluates semantic matching. The paired query construction allows the controlled measurement of how much retrieval performance is attributable to surface-form term recognition versus semantic generalization.

6. Strengths and Limitations

Benefits:

TCQs afford precise measurement of lexical retrieval, since each query contains explicit technical terms, thereby isolating the ability of models to match on surface form.
TCQs, when paired with TAQs, enable controlled analysis of the lexical-semantic matching spectrum and quantify the retrieval dependence on direct terminology overlap.
The construction of TCQs reflects real-world domain practices: engineers and practitioners often use exact part names, acronyms, and domain-specific jargon in practical search scenarios, enhancing ecological validity.

Limitations:

TCQs are machine-generated and can exhibit a uniform or over-structured style compared to genuine user queries.
They emphasize lexical matching exclusively and thus do not test semantic retrieval for synonymy or innovative paraphrasing of terms.
TCQs are restricted to queries answerable from a single passage, excluding broader multi-hop reasoning or negative (unanswerable) query types.
Cross-lingual TCQs retain English technical terms, which may not always mirror actual user translation/reformulation practices in non-English contexts (Kim, 7 Jan 2026).

A plausible implication is that while TCQs provide a robust framework for benchmarking lexical retrieval, a holistic evaluation of IR systems also requires complementary query formats such as TAQs and realistic user logs.

7. Applications and Impact in Domain-Specific IR Benchmarking

TCQs form a core component of the STELLA aerospace IR benchmark, which enables reproducible and interpretable evaluation of lexical and semantic search models in technical document collections. Their design allows benchmarking classical methods (e.g., BM25) directly against deep embedding models, with evidence showing that lexical methods remain competitive in technical domains where exact term matching is essential. The approach pioneered by TCQs advances domain-specific IR benchmarking by ensuring terminological rigor and systematic measurement of retrieval models’ handling of surface-form terminology, which is crucial in high-stakes engineering and scientific information systems (Kim, 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

STELLA: Self-Reflective Terminology-Aware Framework for Building an Aerospace Information Retrieval Benchmark (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Terminology Concordant Query (TCQ).

Terminology Concordant Queries (TCQ)

1. Formal Definition and Construction

2. Generation Pipeline in the STELLA Framework

3. Evaluation Methods and Lexical-Matching Metrics

4. Illustrative Example

5. Comparative Role versus Terminology Agnostic Queries (TAQs)

6. Strengths and Limitations

7. Applications and Impact in Domain-Specific IR Benchmarking

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Terminology Concordant Queries (TCQ)

1. Formal Definition and Construction

2. Generation Pipeline in the STELLA Framework

3. Evaluation Methods and Lexical-Matching Metrics

4. Illustrative Example

5. Comparative Role versus Terminology Agnostic Queries (TAQs)

6. Strengths and Limitations

7. Applications and Impact in Domain-Specific IR Benchmarking

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research