- The paper introduces a unified architecture using orthogonal subspace projection within a shared LoRA module to eliminate cross-task gradient conflicts.
- It employs Fisher-guided negative guidance to suppress residual semantic interference, significantly enhancing generative fidelity.
- Empirical evaluations on virtual try-on, garment reconstruction, and pose transfer demonstrate state-of-the-art performance using key perceptual metrics.
OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation
Introduction
OrthoTryOn addresses the long-standing challenge of integrating multiple fashion generation tasksโvirtual try-on (VTON), garment reconstruction, and pose transferโwithin a single unified architecture. Conventional approaches to multi-task learning in fashion either maintain task-specific models, yielding large computational and deployment burdens, or attempt naive parameter sharing, which leads to deteriorated generative fidelity due to severe inter-task gradient conflict. OrthoTryOn introduces a structural decoupling strategy based on Orthogonal Subspace Projection (OSP) within a shared Low-Rank Adaptation (LoRA) module, combined with Fisher-guided Negative Guidance (FNG), resulting in conflict-free joint optimization and state-of-the-art performance across all evaluated tasks.
Figure 1: OrthoTryOn is a unified generalist model capable of handling diverse fashion tasks within a single architecture, including virtual try-on, garment reconstruction, and pose transfer. It naturally supports sequential editing by chaining task-specific conditions.
Unified Multi-Task Fashion Generation: Limitations and Design Motivation
Existing unified paradigms, such as Any2AnyTryon and UniFit, employ a single LoRA module for all tasks to facilitate computational efficiency. However, forces from disparate semantic objectivesโe.g., strong spatial alignment for VTON versus object-based structural preservation for garment reconstructionโcause destructive cross-task gradient interference. Empirical analysis reveals that naive LoRA-sharing results in gradient norms during optimization that are a mere fraction of those observed in task-specific training, unequivocally indicating gradient cancellation and convergence to suboptimal solutions for all tasks.
OrthoTryOn Framework
Orthogonal Subspace Projection (OSP)
OSP is the linchpin of OrthoTryOn, structurally enabling decorrelated joint optimization within a single LoRA module. For each task, an orthogonal rotation matrix Qiโ is injected into the LoRA bottleneck, yielding the projected feature xAQiโB. This forward reparameterization ensures that task-specific weight increments are statistically decorrelated in expectation:
E[โจฮWiโ,ฮWjโโฉFโ]=0,i๎ =j,
where ฮWiโ=AQiโB. Gradient-level interference is suppressed with a provable O(1/r) bound, where r is the LoRA bottleneck rank. These orthogonal rotations are sampled once per task and frozen throughout training, introducing negligible overhead and ensuring isometric transformations in feature space.
Figure 3: OrthoTryOn overview. Orthogonal Subspace Projection utilizes task-specific orthogonal matrices in the shared LoRA module to rotate bottleneck features into decorrelated frames. Fisher-guided Negative Guidance explicitly suppresses the most interfering task to prevent semantic leakage.
Fisher-guided Negative Guidance (FNG)
Despite OSP, residual semantic coupling persists for low-rank LoRA settings (r small), particularly when task visual semantics are similar. FNG is a parameter-free inference mechanism: it computes a task-task parameter sensitivity similarity via the diagonal Fisher Information Matrix accumulated offline. During inference, generation trajectories are repelled from the most confusable alternative task by leveraging Classifier-Free Guidance (CFG), substituting the unconditional null-prompt with the interfering task condition. This process explicitly mitigates semantic leakage, reducing output ambiguity and artifact frequency.
Experimental Evaluation
OrthoTryOn is evaluated on key fashion datasets (VITON-HD for VTON and garment reconstruction; DeepFashion for pose transfer), using a LongCat-Image-Edit backbone. All components are implemented in PyTorch, and all tasks are trained on the same unified dataset.
State-of-the-Art Comparison
OrthoTryOn consistently outperforms both unified and task-specific experts, in terms of perceptual and fidelity metrics (LPIPS, SSIM, FID, CLIP-I, DISTS).
- Virtual Try-On: OrthoTryOn surpasses prior unified models and independently-trained specialists in FID (8.312), LPIPS (0.064), and KID (0.532). Visual inspection reveals accurate garment texture preservation and minimal artifacts.
Figure 2: Qualitative comparison of virtual try-on results on VITON-HD. OrthoTryOn preserves garment realism and effectively mitigates artifacts.
- Garment Reconstruction: OrthoTryOn achieves lower FID (9.563), LPIPS (0.192), and DISTS (0.191), showing accurate recovery of fine-grained texture and structure.
Figure 4: Qualitative comparison of garment reconstruction on VITON-HD. OrthoTryOn strictly preserves complex garment topology and texture.
- Pose Transfer: Despite using only sparse skeletons (versus dense priors used by experts), OrthoTryOn attains FID 6.364, outperforming prior work and avoiding blurring and hallucination.
Ablation Studies
Ablation analysis establishes the necessity of each design component. Naive multi-task LoRA leads to severe texture blur and fidelity collapse; orthogonal random bottleneck projections alone (non-orthogonal) further exacerbate performance. OSP robustly reduces gradient conflict, outperforming both naive joint training and task-specific experts. FNG is critical for eliminating residual semantic leakage, as evidenced by artifact removal in qualitative ablations.

Figure 5: Qualitative ablation study on variants. OSP alone improves fidelity, while OSP + FNG eradicates both major artifacts and subtle leakages.
Cross-Backbone Generalizability
OSP and FNG exhibit strong generalization. When ported to diverse diffusion architectures (e.g., FLUX.1, Stable Diffusion 2.1, AnyDoor), consistent improvements are observed across all evaluation tasks and metrics. This highlights the architecture-agnostic nature of the proposed design.
Implications and Future Directions
OrthoTryOn fundamentally advances unified controllable synthesis by establishing that structured low-rank parameter geometryโspecifically, orthogonalizationโcan simultaneously suppress negative transfer and encourage latent positive transfer in multi-task generation. While the O(1/r) gradient interference limit of OSP suggests diminishing returns for extremely low-rank, many-task settings, the framework is robust in practical parameter regimes. It also demonstrates that multi-modal generation backbones can become true universal editors without manual task or parameter allocation overhead.
Theoretically, OSP could inform future multi-task adaptation methods across other modalities and tasks by guiding the architectural design of parameter-efficient decoupling mechanisms. Practically, OrthoTryOn streamlines deployment for real-world digital fashion workflowsโreducing model storage, maintenance, and switching costsโwhile delivering superior output quality.
Conclusion
OrthoTryOn introduces a principled, plug-and-play approach for achieving conflict-free unified fashion generation via Orthogonal Subspace Projection and Fisher-guided Negative Guidance. By structurally decoupling tasks within a shared LoRA module and leveraging parameter sensitivity for inference-time disambiguation, it systematically overcomes traditional multi-task learning bottlenecks. As validated by comprehensive experiments, these innovations elevate both performance ceilings and practical scalability for universal conditional image editors in the fashion domain.