Inference-Time Knowledge Composition

Updated 30 June 2025

Inference-time knowledge composition is a method for dynamically integrating independently trained knowledge modules at runtime, enabling systems to adapt across diverse modalities and tasks.
It leverages a weighted sum of outputs from specialized pre-trained policies, as demonstrated in the Modality-Composable Diffusion Policy, to perform efficient inference without retraining.
This approach enhances robustness by compensating for individual modality failures and supports rapid adaptation, yielding notable performance improvements in robotics benchmarks.

Inference-time knowledge composition refers to methods that integrate or combine knowledge—often represented in distinct models, modules, or subcomponents—dynamically at inference (test) time, rather than statically during training. This approach enables systems to adapt, generalize, or transfer knowledge across modalities, domains, or tasks by leveraging existing specialized components or policies in new compositions. The concept is exemplified in the Modality-Composable Diffusion Policy (MCDP), which demonstrates distribution-level policy composition for robotics, as well as in related frameworks for language, vision, and multi-task learning.

1. Principles of Inference-Time Policy Composition

Inference-time policy composition in MCDP operates by aggregating the outputs (specifically, the diffusion noise estimates) of several pre-trained Diffusion Policies (DPs), each conditioned on a distinct input modality (such as RGB images or point clouds). Rather than retraining or fine-tuning a monolithic model on combined modalities, MCDP uses the weighted sum of the noise estimates from the constituent DPs to construct a composed policy directly during the denoising process of generation.

Let $n$ be the number of modalities, $\mathcal{M}_i$ an individual modality (e.g., img, pcd), and $w_i$ the corresponding manual weight ( $\sum_i w_i = 1$ ). The noise estimates from each $\mathcal{M}_i$ are combined as follows: $\hat{\epsilon}_{\mathcal{M}^*}(\tau_t, t) = \sum_{i=1}^{n} w_i \, \epsilon_\theta(\tau_t, t, \mathcal{M}_i)$ This composite score is then used in the denoising update: $\tau^{t-1} = \alpha^t (\tau^t - \gamma^t \hat{\epsilon}_{\mathcal{M}^*}(\tau^t, t)) + \xi,\quad \xi \sim \mathcal{N}(0, \sigma_t^2 I)$ where $\tau^t$ is the latent trajectory at step $t$ .

This method is conceptually motivated by compositional approaches for diffusion models and energy-based models, where distribution-level combinations (rather than latent-level concatenations) yield richer, more generalizable results.

2. Modularity, Adaptability, and Data Efficiency

MCDP is modular by design. Policies are trained independently on each modality using available data rather than a costly joint dataset. These modular policies can then be flexibly recombined during deployment by adjusting the weighting coefficients to suit the specific inference conditions—without any new model retraining.

This approach enhances data efficiency: there is no requirement to collect a massive, co-registered multi-modal training corpus. Each DP leverages its own unimodal data, which is easier to acquire and manage. The composition at inference time thus enables rapid adaptation to new sensing configurations or robot platforms, and the system can easily scale to additional modalities.

MCDP also supports balanced generalization: By tuning the composition weights $w_i$ , practitioners can accentuate or de-emphasize individual modalities depending on their reliability or informativeness for a given task or environment.

3. Empirical Validation and Performance Characteristics

MCDP was evaluated on the RoboTwin dual-arm manipulation benchmark, which features tasks requiring robust perception and precise action. Pre-trained DPs were used for RGB (DP_img) and point cloud (DP_pcd) modalities. Key empirical findings include:

Performance Improvements: When both constituent DPs have moderate efficacy (e.g., >30% task success), composing them at inference consistently yields higher task success rates than either DP alone. For example, on the Empty Cup Place task, success improved from 0.42 (DP_img) and 0.62 (DP_pcd) to 0.86 (MCDP) with optimal weighting.
Robustness to Modality Failures: Visualizations indicate that the composed policy often succeeds where one modality-specific DP fails, by leveraging corrective information from the other modality.
Weight Sensitivity: Performance peaks typically occur when weights are aligned towards the better DP, though sometimes intermediate mixtures best cover variation; this suggests the method is robust but sensitive to weight selection.
Efficiency: The algorithm is computationally lightweight at inference, avoiding repeated forward passes (unlike approaches such as classifier-free guidance in guidance-based diffusion models).

A limitation is noted when one DP is particularly weak; composing its outputs may degrade performance.

4. Applications and Advantages in Robotics and Policy Learning

Inference-time knowledge composition, as operationalized in MCDP, provides several advantages for robot policy learning and deployment:

Cross-Modality Integration: Facilitates fusing information from multiple perception modalities (vision, point cloud, tactile) without unified large-scale datasets.
Cross-Domain and Cross-Embodiment Policy Reuse: DPs trained on different robots, simulation domains, or hardware can be composed, assuming compatibility in the action/trajectory space.
Rapid Policy Adaptation: When a new sensor or modality becomes available, one can train a new DP and immediately combine it with existing ones, supporting rapid system adaptation and robustness to sensor failures or environmental changes.
Scalability: The framework supports incremental extension—new DPs can be incorporated without retraining or redesigning the composition method.

The approach is relevant for industrial robotics (factories, warehouses), multi-robot systems, and settings requiring agile deployment across variable sensing platforms.

5. Challenges and Considerations for Further Research

Several practical and theoretical challenges are identified:

Weight Selection: Current composition weights are set manually and tuned per task. Suboptimal weights can impair results, particularly when one DP is weak. Automating weight selection (e.g., confidence-driven or adaptive strategies) is an open direction.
Policy Compatibility: The combined DPs must output compatible score estimates (comparable scale, semantics); disparities can hinder effective composition.
Negative Transfer Risks: Inclusion of a poorly performing DP may reduce, rather than boost, overall performance. Mechanisms to mitigate or suppress weak modality influence are required.
Scalability to Many Modalities and Policy Types: While current evaluations cover two modalities, scaling to many (or structurally diverse) DPs may require additional normalization or calibration.

Future research is aimed at learning weight selection schemes, supporting broader and hierarchical composition (e.g., skill/subtask level), and enabling composition across robot embodiments or control spaces through alignment strategies.

6. Comparison with Prior Methods

Traditional approaches require retraining on joint or multi-modal datasets, leading to substantial computational and data costs. In contrast, MCDP achieves composition at the distributional level—enabling reuse of pre-existing, independently trained policies. Unlike classifier-free guidance, which doubles inference computation by mixing conditional and unconditional denoising, MCDP's weighted score fusion is computationally efficient.

A strength is the decoupling of training and deployment; new knowledge sources can be composed at run-time, facilitating continual policy improvement and adaptation as new perception streams or tasks are encountered.

7. Summary Perspective

Inference-time knowledge composition, as realized by MCDP, exemplifies a practical, modular, and scalable avenue for enhancing policy generalization and adaptability. By composing independently learned policy distributions according to the requirements of each deployment, robotics and other sequential decision systems can achieve rapid adaptation to diverse environments and sensing configurations, all with minimal additional computational burden or retraining.

Property	Traditional Single-Modality DP	Modality-Composable Diffusion Policy (MCDP)
Training Requirement	Joint training for each modality set	Independent training per modality
Inference-time Modality Fusion	Not supported	Distribution-level composition (weighted sum)
Adaptation to New Modalities	Retraining needed	Immediate, by adding new DP and adjusting weights
Computational Efficiency	Single model, moderate	Multiple light-weight DPs, no extra cost
Performance on Multi-modal Tasks	Limited by modality	Enhanced by complementary information
Scalability	Tied to training set size	Modular, extensible

This approach positions inference-time composition as a foundational design pattern for future modular, robust, and adaptive AI systems in robotics and beyond.

PDF Markdown Chat (Upgrade)