Query-Adaptive Policy Networks

Updated 17 November 2025

The concept of query-adaptive policy networks is defined as models where policy outputs depend on both state and a query to guide task-specific decisions.
They employ architectures like pseudo-siamese networks and cross-attention modules to effectively fuse query and state information for improved control and inference.
Empirical results demonstrate that query-conditioning significantly boosts performance metrics in areas such as temporal prediction, image retrieval, and adaptive experimental design.

A query-adaptive policy network is an architectural and algorithmic pattern in which the policy (i.e., decision-making function) is directly conditioned on a query or task specification, enabling a single model to adapt its behavior or inference to the structure and semantics of the provided query. This paradigm exists at the intersection of reinforcement learning, adaptive representation learning, and modular neural inference, and encompasses applications from temporal knowledge graph prediction and deep hashing to interpretable reinforcement learning and Bayesian optimal experimental design. Recent work establishes the necessity of bespoke architectures, explicit query-conditioned processing, and adaptive gating to realize the full potential of flexible, query-driven policy inference.

1. Foundations and Formal Definitions

Formally, a query-adaptive policy network generalizes the standard policy $\pi_\theta(s)$ —mapping states $s$ to actions—into a conditional map $\pi_\theta(s, q)$ where $q$ encodes a query or task. This enables the agent or model to calibrate its inferences or actions given not only the observed environment state but also an explicit specification of what is to be answered or accomplished.

In temporal knowledge prediction (Shao et al., 2022), the environment is a partially observed Markov decision process with state $s_k = (e_k, t_k, s_q, r_q, o_q, t_q)$ , and the policy is conditioned on query parameters $(s_q, r_q, t_q)$ .
In interpretable RL with query-specific modules (Zakershahrak, 11 Nov 2025), the policy (and associated heads) is conditioned on a query $q$ selecting between types: point queries (policy, value, $Q$ ), set queries (reachability), path queries, and comparative queries.
For retrieval tasks (Wang et al., 2019), policies for sampling hash codes are directly adapted to the retrieval query image.
In adaptive design (Ivanova et al., 2021), the policy for selecting experimental design $\xi_t$ depends on the query—defined implicitly by the prior over model parameters and experimental history.

A query $q$ may encapsulate relational, temporal, comparative, or task-specific instructions, often realized as an embedding or structured token sequence, and modulate the policy via architectural fusion or dynamic gating mechanisms.

2. Architectural Building Blocks and Adaptivity Mechanisms

Implementations share a common structure: a query representation module, a (possibly modular) policy or inference head, and means for query–state fusion. Specific designs include:

Pseudo-Siamese Policy Networks (Shao et al., 2022): Two LSTM-based sub-policies are defined. Policy I captures static entity–relation sequences; Policy II encodes temporal relation–time paths via a learned temporal relation encoder. An adaptive gating mechanism computes a per-step gate $g_k = \sigma(W_g [h_{k-1}^t; r_k^{t_k}; r_q])$ to interpolate between static and temporal reasoning.
Query Conditioned Deterministic Inference Networks (QDIN) (Zakershahrak, 11 Nov 2025): A query $q$ $q$ is embedded—via separate type and parameter encoders—and fused with state features through cross-attention. The architecture separates inference heads for different query types:
- Policy head for action selection
- Reachability head using transpose convolutions with skip-connections
- Path head (LSTM-pointer net)
- Comparison head (contrastive MLP towers)
Listwise Deep Policy Hashing (Wang et al., 2019): Each query is an image, and the query network generates a vector $s \in [0,1]^K$ parameterizing a Bernoulli policy over code bits, which is sampled to form adaptive hash codes.
Implicit Deep Adaptive Design (iDAD) (Ivanova et al., 2021): The policy network $\pi_\phi$ is a function from experimental history to the next design, implemented via MLPs, LSTM, or self-attention aggregation, depending on conditional independence assumptions.

Adaptivity is typically realized through either (1) explicit selection, interpolation, or composition of policy outputs via gates, or (2) direct parameterization of the policy by the query embedding, such that gradient flow from reward signals adjusts the entire query–policy map.

3. Learning Objectives and Training Dynamics

The training regimes for query-adaptive policy networks are multi-objective and involve explicit reinforcement of query-specific competence:

Multi-Objective RL with Joint Losses (Zakershahrak, 11 Nov 2025):

$\mathcal L(\theta) = \alpha_{\mathrm{control}} \mathcal L_{\mathrm{TD}} + \sum_{q \in \mathcal Q} \alpha_{q}\,\widehat{\mathcal L}_q + \lambda\,\mathcal L_{\mathrm{consistency}}$

Each query head computes its own loss: binary cross-entropy for reachability masks, mean-absolute error and cross-entropy for paths, contrastive losses for comparisons, and policy cross-entropy, all normalized to handle different learning rates and scales.

Policy Gradient Methods (Shao et al., 2022, Wang et al., 2019, Ivanova et al., 2021):
- In temporal KG prediction, REINFORCE is used to maximize expected terminal reward for correct answer prediction, employing query-specific rollouts weighted by adaptive gates.
- In retrieval, the policy gradient aims to maximize expected listwise retrieval reward, e.g., Average Precision (AP), using a baseline for variance reduction.
- In adaptive experimental design, variational lower bounds (InfoNCE, NWJ) replace classical BOED objectives, enabling policy gradients via reparameterization for implicit models.

Curriculum sampling strategies (e.g., increasing query complexity over time) further regularize the learning process and ensure robustness to query structure (Zakershahrak, 11 Nov 2025).

4. Empirical Results and Efficacy of Query-Adaptivity

Across domains, query-adaptive policy networks provide notable performance and interpretability benefits:

Model/Setting	Task	Key Metrics	Result Highlights
QDIN (Mixed) (Zakershahrak, 11 Nov 2025)	RL (MDP queries)	Reach IoU, Path MAE, Return	0.97, 2.1, 0.82 under mixed training
Pseudo-Siamese PN (Shao et al., 2022)	Temporal KG	ICEWS2014 MRR; WIKI Hits@10	0.429; 0.804 (SOTA)
Deep Policy Hashing (Wang et al., 2019)	Image retrieval	mAP over full dataset	Outperforms all baselines
iDAD (Ivanova et al., 2021)	Experimental design	Expected Information Gain (nats)	≈7.75 (vs. random ≈4.79) in 20-D location finding; sub-20 ms proposal times

Ablation studies in (Zakershahrak, 11 Nov 2025) reveal removal of query–state cross-attention drops reach IoU by 12 pts and inflates path MAE by +3.4, while loss of specialized heads reduces inference accuracy by 15–18 pts. Removing the gating mechanism in pseudo-siamese PN yields ≈3.5% drop in MRR (Shao et al., 2022).

A salient phenomenon reported in (Zakershahrak, 11 Nov 2025) is the decoupling between inference accuracy and control: query-only training yields reachability IoU of 0.99 but return of 0.31, whereas control-only achieves return 0.89 but IoU only 0.72. This suggests fundamentally distinct representational requirements for world-model inference and control performance.

5. Case Studies and Deployment Scenarios

Temporal Knowledge Graphs: The adaptive pseudo-siamese policy network enables both efficient prediction for previously seen and inductive generalization to unseen entities via semantic edges at step 0, combined with a temporal relation encoder. The adaptive gate dynamically interpolates between static and temporal reasoning, tailored to the query subject's occurrence in history.
Interpretable RL: QDIN provides direct answers to reachability, path, and comparative queries, supporting human-AI collaboration, formal verification, and interpretability.
Image Retrieval: Query-adaptive hashing ensures that binary codes for each query maximize global ranking metrics by learning per-query bit distributions, a method only feasible via RL with listwise rewards.
Adaptive Experimental Design: iDAD shifts the computational burden offline, enabling millisecond-scale, query-contingent design proposals for complex implicit simulators without tractable likelihoods.

6. Limitations and Prospective Extensions

Key limitations identified in these architectures are:

The increased parameterization and complexity, making offline training (e.g., iDAD) resource intensive.
Dependence on differentiable query embeddings and environments, restricting deployment in discrete or non-differentiable settings (Ivanova et al., 2021).
Catastrophic forgetting or suboptimal Pareto-trading between objectives unless losses are appropriately normalized and balanced (Zakershahrak, 11 Nov 2025).

Potential research directions include hybrid policies that incorporate stochastic components, extensions to variable or structured query classes, improved regularization across conflicting objectives, and meta-learning for rapid adaptation to new query types or distributions.

A plausible implication is that the explicit separation of inference and control representations, as observed empirically, could lead to the development of RL systems that simultaneously support high-fidelity querying, planning, and robust policy execution, with modular compositionality at the core of future architectures.

7. Impact and Theoretical Significance

The shift to query-adaptive policy networks reframes the role of the agent from merely an action executor to a general inference engine over environment models. Making queries first-class citizens in network and objective design enables new modalities for explainability, verification, human interaction, and sample efficiency.

The demonstrated empirical decoupling between world-model representation and control opens foundational questions about neural decomposition, modularity, and the theoretical underpinnings of reinforcement learning as knowledge acquisition as opposed to pure behavior optimization. This conceptual development establishes a new research agenda where learning, acting, and knowing are not conflated, but instead are enabled to co-exist and be queried systematically within the same architecture.