OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation

Published 26 Jun 2026 in cs.CV | (2606.27880v1)

Abstract: Unified fashion generation integrates tasks like virtual try-on and garment reconstruction into a single model to reduce task-specific adaptation costs. However, naive parameter sharing across semantically distinct tasks induces negative transfer through severe inter-task gradient conflict. We propose OrthoTryOn, a unified framework mitigating this interference within a shared Low-Rank Adaptation (LoRA) module. Its Orthogonal Subspace Projection (OSP) applies task-specific orthogonal rotations to bottleneck features, mapping them into decorrelated coordinate frames. To address residual semantic coupling at inference time, we further propose Fisher-guided Negative Guidance (FNG), a parameter-free strategy that utilizes diagonal Fisher information to quantify inter-task sensitivity overlap and explicitly repels generation trajectories from the most confusable task via Classifier-Free Guidance. Extensive experiments demonstrate that OrthoTryOn avoids the severe performance degradation typical of naive unified training and even surpasses independently trained task-specific models, achieving state-of-the-art results across multiple benchmarks while generalizing robustly across diverse diffusion backbones. Code is available at https://github.com/NJU-PCALab/OrthoTryOn.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a unified architecture using orthogonal subspace projection within a shared LoRA module to eliminate cross-task gradient conflicts.
It employs Fisher-guided negative guidance to suppress residual semantic interference, significantly enhancing generative fidelity.
Empirical evaluations on virtual try-on, garment reconstruction, and pose transfer demonstrate state-of-the-art performance using key perceptual metrics.

OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation

Introduction

OrthoTryOn addresses the long-standing challenge of integrating multiple fashion generation tasks—virtual try-on (VTON), garment reconstruction, and pose transfer—within a single unified architecture. Conventional approaches to multi-task learning in fashion either maintain task-specific models, yielding large computational and deployment burdens, or attempt naive parameter sharing, which leads to deteriorated generative fidelity due to severe inter-task gradient conflict. OrthoTryOn introduces a structural decoupling strategy based on Orthogonal Subspace Projection (OSP) within a shared Low-Rank Adaptation (LoRA) module, combined with Fisher-guided Negative Guidance (FNG), resulting in conflict-free joint optimization and state-of-the-art performance across all evaluated tasks.

Figure 1: OrthoTryOn is a unified generalist model capable of handling diverse fashion tasks within a single architecture, including virtual try-on, garment reconstruction, and pose transfer. It naturally supports sequential editing by chaining task-specific conditions.

Unified Multi-Task Fashion Generation: Limitations and Design Motivation

Existing unified paradigms, such as Any2AnyTryon and UniFit, employ a single LoRA module for all tasks to facilitate computational efficiency. However, forces from disparate semantic objectives—e.g., strong spatial alignment for VTON versus object-based structural preservation for garment reconstruction—cause destructive cross-task gradient interference. Empirical analysis reveals that naive LoRA-sharing results in gradient norms during optimization that are a mere fraction of those observed in task-specific training, unequivocally indicating gradient cancellation and convergence to suboptimal solutions for all tasks.

OrthoTryOn Framework

Orthogonal Subspace Projection (OSP)

OSP is the linchpin of OrthoTryOn, structurally enabling decorrelated joint optimization within a single LoRA module. For each task, an orthogonal rotation matrix $Q_i$ is injected into the LoRA bottleneck, yielding the projected feature $xAQ_iB$ . This forward reparameterization ensures that task-specific weight increments are statistically decorrelated in expectation:

$\mathbb{E}[\langle \Delta W_i, \Delta W_j \rangle_F] = 0, \quad i \neq j,$

where $\Delta W_i = AQ_iB$ . Gradient-level interference is suppressed with a provable $\mathcal{O}(1/r)$ bound, where $r$ is the LoRA bottleneck rank. These orthogonal rotations are sampled once per task and frozen throughout training, introducing negligible overhead and ensuring isometric transformations in feature space.

Figure 3: OrthoTryOn overview. Orthogonal Subspace Projection utilizes task-specific orthogonal matrices in the shared LoRA module to rotate bottleneck features into decorrelated frames. Fisher-guided Negative Guidance explicitly suppresses the most interfering task to prevent semantic leakage.

Fisher-guided Negative Guidance (FNG)

Despite OSP, residual semantic coupling persists for low-rank LoRA settings ( $r$ small), particularly when task visual semantics are similar. FNG is a parameter-free inference mechanism: it computes a task-task parameter sensitivity similarity via the diagonal Fisher Information Matrix accumulated offline. During inference, generation trajectories are repelled from the most confusable alternative task by leveraging Classifier-Free Guidance (CFG), substituting the unconditional null-prompt with the interfering task condition. This process explicitly mitigates semantic leakage, reducing output ambiguity and artifact frequency.

Experimental Evaluation

OrthoTryOn is evaluated on key fashion datasets (VITON-HD for VTON and garment reconstruction; DeepFashion for pose transfer), using a LongCat-Image-Edit backbone. All components are implemented in PyTorch, and all tasks are trained on the same unified dataset.

State-of-the-Art Comparison

OrthoTryOn consistently outperforms both unified and task-specific experts, in terms of perceptual and fidelity metrics (LPIPS, SSIM, FID, CLIP-I, DISTS).

Virtual Try-On: OrthoTryOn surpasses prior unified models and independently-trained specialists in FID (8.312), LPIPS (0.064), and KID (0.532). Visual inspection reveals accurate garment texture preservation and minimal artifacts.
Figure 2: Qualitative comparison of virtual try-on results on VITON-HD. OrthoTryOn preserves garment realism and effectively mitigates artifacts.
Garment Reconstruction: OrthoTryOn achieves lower FID (9.563), LPIPS (0.192), and DISTS (0.191), showing accurate recovery of fine-grained texture and structure.
Figure 4: Qualitative comparison of garment reconstruction on VITON-HD. OrthoTryOn strictly preserves complex garment topology and texture.
Pose Transfer: Despite using only sparse skeletons (versus dense priors used by experts), OrthoTryOn attains FID 6.364, outperforming prior work and avoiding blurring and hallucination.

Ablation Studies

Ablation analysis establishes the necessity of each design component. Naive multi-task LoRA leads to severe texture blur and fidelity collapse; orthogonal random bottleneck projections alone (non-orthogonal) further exacerbate performance. OSP robustly reduces gradient conflict, outperforming both naive joint training and task-specific experts. FNG is critical for eliminating residual semantic leakage, as evidenced by artifact removal in qualitative ablations.

Figure 5: Qualitative ablation study on variants. OSP alone improves fidelity, while OSP + FNG eradicates both major artifacts and subtle leakages.

Cross-Backbone Generalizability

OSP and FNG exhibit strong generalization. When ported to diverse diffusion architectures (e.g., FLUX.1, Stable Diffusion 2.1, AnyDoor), consistent improvements are observed across all evaluation tasks and metrics. This highlights the architecture-agnostic nature of the proposed design.

Implications and Future Directions

OrthoTryOn fundamentally advances unified controllable synthesis by establishing that structured low-rank parameter geometry—specifically, orthogonalization—can simultaneously suppress negative transfer and encourage latent positive transfer in multi-task generation. While the $\mathcal{O}(1/r)$ gradient interference limit of OSP suggests diminishing returns for extremely low-rank, many-task settings, the framework is robust in practical parameter regimes. It also demonstrates that multi-modal generation backbones can become true universal editors without manual task or parameter allocation overhead.

Theoretically, OSP could inform future multi-task adaptation methods across other modalities and tasks by guiding the architectural design of parameter-efficient decoupling mechanisms. Practically, OrthoTryOn streamlines deployment for real-world digital fashion workflows—reducing model storage, maintenance, and switching costs—while delivering superior output quality.

Conclusion

OrthoTryOn introduces a principled, plug-and-play approach for achieving conflict-free unified fashion generation via Orthogonal Subspace Projection and Fisher-guided Negative Guidance. By structurally decoupling tasks within a shared LoRA module and leveraging parameter sensitivity for inference-time disambiguation, it systematically overcomes traditional multi-task learning bottlenecks. As validated by comprehensive experiments, these innovations elevate both performance ceilings and practical scalability for universal conditional image editors in the fashion domain.

Markdown Report Issue