Homologous Latents Fusion in PSSP & Video Restoration
- Homologous Latents Fusion is a methodology for integrating latent representations from models with matching architectures, enabling enhanced prediction accuracy.
- In protein secondary structure prediction, it fuses low-quality evolutionary profiles with BERT-derived pseudo-profiles to reliably predict residue properties in low-homology conditions.
- For zero-shot video restoration, it dynamically blends image and video diffusion model latents using an adaptive Chain-of-Thought strategy to maintain spatial detail and temporal consistency.
Homologous Latents Fusion is a class of methodologies for integrating latent representations—derived from models sharing identical or nearly identical architectures—within a unified latent space. The principal variants address critical challenges in both protein secondary structure prediction (PSSP) for low-homology proteins and zero-shot video restoration using diffusion models by leveraging data-driven or model-driven fusions of homologous latent vectors. This entry surveys the two major lines of homologous latents fusion: residue-wise profile fusion for protein sequences (Wang et al., 2021) and framewise latent fusion for temporally consistent video restoration (Cao et al., 29 Jan 2026).
1. Definition and Conceptual Foundation
Homologous Latents Fusion spans domains in which at least two latent representations, mapped from distinct but structurally aligned models (e.g., protein BERT and shallow MSA-constructed profiles, or image and video diffusion models built on a shared VAE), are combined using adaptive or convex weighting. In protein PSSP, it refers to residue-level fusion of weak evolutionary profiles and external knowledge-derived pseudo-profiles. In vision diffusion, it denotes the linear combination of image (IR/IE) and video (T2V) model latents sharing the same VAE latent space, performed synchronously at each step of the diffusion trajectory.
2. Homologous Latents Fusion in Protein Secondary Structure Prediction
For low-homology proteins, evolutionary profiles constructed from MSAs are often unreliable due to small sample sizes. The homologous latents fusion approach ("Adaptive Residue-wise Profile Fusion") (Wang et al., 2021) addresses this by combining:
- Low-quality profile (): Derived from shallow MSAs, defined as .
- BERT-derived pseudo-profile (): Obtained by masking and probing a pretrained protein BERT, producing an implicit residue distribution based on global protein sequence knowledge.
An adaptive fusion is performed at each residue :
where weights are inferred from a grading network conditioned on the accuracy of auxiliary PSSP heads for both channels. Supervision is provided by pseudo-labels computed from per-residue cross-entropy errors, penalizing deviation in log space.
A feature consistency loss ensures that the fused representation remains semantically aligned with true high-quality profiles by matching BiLSTM features, KL-divergence between predicted softmax distributions, and final PSSP cross-entropy.
This residue-wise mechanism is especially effective for orphan sequences, as the BERT pseudo-profile supplies informative priors while the adaptive fusion weight allocates confidence locally according to evolutionary signal quality.
3. Homologous Latents Fusion in Diffusion-Based Video Restoration
In the context of zero-shot video restoration (Cao et al., 29 Jan 2026), homologous latents fusion capitalizes on the architectural alignment between state-of-the-art image restoration models and video diffusion models such as Zeroscope (SD v1.5-based). Both operate in an identical VAE latent space, enabling direct convex fusion.
At each reverse-diffusion timestep :
where is the IR model's denoising latent, is the homologous T2V model's latent, and is a dynamically selected fusion ratio.
Both models advance with the fused latent, ensuring framewise and temporal consistency. The overall pipeline is training-free and agnostic to the specific IR method.
4. Dynamic Fusion Ratio Selection: Chain-of-Thought (COT) Strategy
A principal challenge in homologous latent fusion is determining the fusion weight for optimal trade-off between spatial detail (IR) and temporal smoothness (T2V). The adaptive COT-based search operates as follows:
- At each timestep , candidate weights centered on the previous step's are sampled across a small interval.
- For each candidate, fused latent is decoded to produce a video segment.
- Perceptual (CLIP-IQA) and temporal (Warp Error, WE) metrics are used to rank all candidates; the sum of their ranks determines the optimal .
- This process maintains the stability of spatial details while substantially suppressing flicker.
This strategy replaces heuristic fusion with metric-driven adaptive mixing and is extensible to any diffusion-based method leveraging a shared latent space.
5. Empirical Performance and Ablation Results
Protein PSSP (Wang et al., 2021):
On the BC40 set, extremely low-homology (MSA count ):
| Method | PSSP Accuracy (%) |
|---|---|
| Low-quality profile | 68.2 |
| Bagging (SOTA) | 70.8 |
| Fusion + consistency | 75.5 |
The fusion model shows a $4.7$ p.p. gain over Bagging and $7.3$ p.p. over the raw profile. Improvements persist for counts , , and , and on other benchmarks.
Video Restoration (Cao et al., 29 Jan 2026):
Ablation on 4× blind video SR, DAVIS benchmark (DiffBIR backbone):
| Configuration | HMLF | COT | WE↓ | t-LPIPS↓ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|---|---|
| Baseline | ✗ | ✗ | 0.806 | 3.92 | 26.50 | 0.6869 |
| + HMLF only | ✔ | ✗ | 0.696 | 3.36 | 26.69 | 0.6981 |
| + All modules | ✔ | ✔ | 0.376 | 0.41 | 27.42 | 0.7388 |
On zero-shot 4× SR with the PSLD backbone, HMLF reduces WE from $0.8408$ to 0.65, and t-LPIPS from $6.28$ to 3.8. With the full pipeline (including dynamic COT), WE drops further to $0.236$ and t-LPIPS to $0.62$, while PSNR and SSIM are preserved or improved.
6. Comparative Context and Distinction from Heterogenous Latent Fusion
Homologous latents fusion is fundamentally distinct from heterogenous latents fusion, which addresses cases where latent spaces cannot be directly blended due to architectural discrepancies (e.g., 3D autoencoders in advanced T2V models versus 2D image VAEs). In such cases, latent fusion requires decoding and re-encoding via a compatible VAE before fusion, at significant computational cost and with additional potential for information loss. In contrast, homologous latents fusion leverages structural concordance for computationally efficient, information-preserving fusion relevant to both protein informatics and video restoration pipelines.
7. Significance and Future Directions
Homologous latents fusion provides a mathematically principled, empirically validated approach for integrating complementary modalities or priors when models share an aligned latent space. In protein structure prediction, it enables accurate inference in the low-homology regime by dynamically exploiting global and local sequence information. In diffusion video restoration, it forms the backbone of state-of-the-art, training-free pipelines with substantially reduced temporal flicker and enhanced spatial detail. A plausible implication is a broader applicability to multimodal and transfer learning settings where model architectures can be aligned at the latent level, potentially yielding new paradigms in knowledge fusion and cross-domain adaptation (Wang et al., 2021, Cao et al., 29 Jan 2026).