Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeleStyle: Content-Preserving Style Transfer

Updated 4 February 2026
  • TeleStyle is a content-preserving image and video style transfer model that disentangles style cues to maintain precise content details.
  • It employs lightweight LoRA modules within a DiT backbone and a multi-stage curriculum learning framework to achieve state-of-the-art performance.
  • The model supports both image and video stylization, ensuring high temporal consistency and enhanced aesthetic quality in outputs.

TeleStyle is a content-preserving image and video style transfer model that operates by generating stylized outputs based on paired content and style references. The model addresses the central challenge in Diffusion Transformers (DiTs)—the entanglement of content and style information in their latent representations—by isolating and routing style cues while maintaining precise content fidelity. TeleStyle is implemented as a lightweight extension of Qwen-Image-Edit, leveraging robust content retention and effective style modulation, and is trained on a hybrid dataset of curated and synthetic triplets using a multi-stage curriculum continual learning framework. Both image and video stylization are supported, with a specialized video-to-video module ensuring high temporal consistency. TeleStyle achieves state-of-the-art results across quantitative and qualitative benchmarks for style similarity, content constancy, and perceptual aesthetics (Zhang et al., 28 Jan 2026).

1. Model Architecture and Disentanglement

TeleStyle’s architecture builds upon the Qwen-Image-Edit transformer, which itself employs an MMDiT-based Diffusion Transformer backbone augmented with Multi-Scale Rotary Positional Embeddings (MS-RoPE) to facilitate processing of multiple reference images. The model ingests a content reference image and a style reference image, each processed via distinct “patch embedder” networks—lightweight convolution-projection modules that encode the input images to token sequences Zcontent,ZstyleRN×dZ_{\rm content}, Z_{\rm style} \in \mathbb R^{N\times d}. These are concatenated channel-wise with the diffusion latent variable xtx_t and any accompanying (potentially empty) text tokens and supplied to a stack of NN DiT blocks.

A central mechanism in TeleStyle is the use of low-rank adaptation (LoRA) modules (rank = 32) inserted into cross-attention and feed-forward layers. These adapters focus style transfer learning within the velocity prediction head vθ()v_\theta(\cdot), leaving the broader content-preservation pathway of the frozen base model undisturbed. This enables effective disentanglement of style and content features, a core limitation in prior DiT-based stylization frameworks.

For video stylization, TeleStyle extends the Wan2.1-1.3B DiT backbone (as used in FullDiT), adopting a similar dual-patch-embedder interface. The positional encoding scheme assigns temporal index 0 to style anchors (the style reference or stylized key frame) and increments through subsequent source video frames (indices 1,,T11,\ldots,T-1), allowing the Transformer’s learned positional dynamics to propagate style coherently across time.

2. Dataset Construction and Triplet Synthesis

Training robust style transfer models requires diverse and well-matched triplets of content, style, and result images. TeleStyle’s training data comprises:

  • Clean (Curated) Triplets: DcollectedD_{\rm collected}—drawn from sources including OmniConsistency, GPT-4o-generated examples, and manually vetted LoRA community outputs. Following intensive manual filtering, this dataset yields 300,000 triplets covering 30 distinct artistic styles, such as oil, watercolor, and ukiyo-e.
  • Noisy (Synthetic) Triplets: DsyntheticD_{\rm synthetic}—1 million triplets synthesized via a reverse pipeline starting from an in-the-wild stylized target image ItargetI_{\rm target}. A photorealistic content reference IcontentI_{\rm content} is generated through FLUX, and the corresponding style reference IstyleI_{\rm style} is extracted using the CDST method and DINOv2 descriptors. Triplets are completed with randomly sampled textual prompts.

Combined, these datasets encompass thousands of style clusters, spanning classical, modern, and digital genres. During preprocessing, content references are aspect-ratio preserved to a minimum edge of 1024 pixels, while style references are center-cropped to squares.

3. Curriculum Continual Learning Paradigm

TeleStyle leverages a curriculum continual learning framework to maximize both style generalization and content fidelity through three sequential training stages (denoted as LoRA weightsets xtx_t0, xtx_t1, xtx_t2):

  1. Stage 1: Capability Activation—LoRA parameters xtx_t3 are trained on the full collected dataset xtx_t4 to acquire general style transfer ability.
  2. Stage 2: Content Fidelity Refinement—The network, initialized from xtx_t5, is fine-tuned on a reweighted subset xtx_t6 favoring high-fidelity content preservation (notably facial character, fine structures) to produce xtx_t7.
  3. Stage 3: Robust Generalization—A mixture dataset xtx_t8, formed by blending xtx_t9 with approximately 5% of NN0, is used to further train from NN1, yielding NN2 with improved cross-domain style generalization.

The learning objective in each phase is a rectified flow-matching loss:

NN3

where NN4 is the noisy latent at time NN5, NN6 is the target latent, NN7 is a fixed prompt embedding describing the style transfer task, and NN8 is the predicted velocity field. Optimization is performed via AdamW with decoupled weight decay.

4. Video-to-Video Stylization and Temporal Consistency

The TeleStyle-Video module begins with a stylized key frame (NN9) alongside source video frames vθ()v_\theta(\cdot)0. Separate patch embedders generate feature tokens for the style and each video frame, which—together with corresponding noisy latents—are processed by the DiT architecture. Positional encoding assigns index 0 to the style and incrementally to the temporally ordered frames, anchoring the style at the first frame and guiding propagation.

Temporal consistency is enforced via a flow-matching loss applied between clean stylized video vθ()v_\theta(\cdot)1 and noise vθ()v_\theta(\cdot)2, using a linear interpolation vθ()v_\theta(\cdot)3:

vθ()v_\theta(\cdot)4

This loss ensures smooth transitions between frames without requiring explicit optical flow estimation or test-time fine-tuning, preserving local and global stylistic coherence throughout temporal sequences.

5. Benchmarks, Metrics, and Quantitative Performance

TeleStyle is evaluated across three principal dimensions:

  • Style Similarity: Measured by the CSD Score (vθ()v_\theta(\cdot)5, higher = better style match).
  • Aesthetic Quality: Assessed with the LAION Aesthetic Predictor (vθ()v_\theta(\cdot)6, higher = more pleasing).
  • Content Preservation: Quantified via a thresholded CPC Score:

vθ()v_\theta(\cdot)7

This metric penalizes degenerate outputs with little or no style transfer.

A summary of benchmark performance is presented below:

Method CSD ↑ [email protected] ↑ [email protected]:0.9 ↑ Aesthetics ↑
CSGO 0.535 0.379 0.224 5.969
DreamO 0.402 0.193 0.102 6.149
TeleStyle 0.577 0.441 0.304 6.317

TeleStyle achieves a 7.8% improvement in style similarity over CSGO, with CPC content scores increasing by 16–20%, and demonstrates superior aesthetic ratings compared to previous DiT-based stylizers. Qualitative analysis confirms TeleStyle’s strong retention of edges, object shapes, and intricate textures across diverse, including previously unseen, styles.

6. Training Procedures and Implementation Specifics

Key training and implementation parameters are as follows:

  • TeleStyle-Image:
    • LoRA rank: 32
    • Base model: Qwen-Image-Edit-2509
    • Gradient checkpointing enabled
    • Minimum image edge: 1024 pixels
    • Hardware: 4 × NVIDIA H100 GPUs, batch size 1/GPU, learning rate 1e-4, 100,000–200,000 LoRA updates per stage
  • TeleStyle-Video:
    • Backbone: Wan2.1-1.3B
    • Training data: synthetic set plus internal filtered clips (filtered via CLIP-based motion assessment)
    • Hardware: 8 × NVIDIA H100 GPUs, batch size 4/GPU, learning rate 1e-5, ~500,000 steps
  • Data Augmentation: Random prompt sampling, random crop/resize of style references, and standard diffusion time schedules
  • Inference: Content reference’s aspect ratio is preserved, style reference resized to a square of side vθ()v_\theta(\cdot)8. The default prompt configuration yields the best stability.

7. Significance and Applications

TeleStyle demonstrates that lightweight, LoRA-driven adaptation atop a robust DiT backbone—combined with curriculum-based exposure to clean and synthetic style-content triplets—produces highly generalizable and efficient style transfer in both images and videos. The ability to preserve content fidelity while enabling strong style generalization, coupled with minimal computational overhead and consistent video stylization, positions TeleStyle as an advance in cross-modal stylization research domains. The availability of the codebase and pre-trained models further facilitates investigation, adoption, and extension in style transfer and creative AI workflows (Zhang et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeleStyle.