Text-to-Duet: Dual-Channel Representations
- Text-to-Duet is a design paradigm that converts text into paired, synchronized representations across dual channels, enabling more nuanced modeling in diverse applications.
- The framework employs complementary representations—such as local/distributed or word/entity pairs—to improve performance metrics in search, recommendation, and multimedia retrieval tasks.
- Dual-space control in Text-to-Duet advances applications in video comprehension, fine-grained image retrieval, motion synthesis, and emotional text-to-speech by aligning and synchronizing multiple information streams.
Searching arXiv for papers on “Text-to-Duet” and closely related “duet” formulations across IR, recommendation, video, motion, image retrieval, music, dance, speech, and TTS. “Text-to-Duet” denotes a family of formulations in which text is transformed into paired, dual-space, or synchronized representations rather than a single monolithic encoding. In the literature, this label has been used for ad-hoc retrieval, recommendation, video understanding, image retrieval, music and speech generation, emotional text-to-speech, and two-person motion synthesis. The common structural motif is a duet between two complementary channels: exact and distributed text signals in web search, words and entities in retrieval, user and item profiles in recommendation, video and text turns in streaming comprehension, sketch and text in fine-grained image retrieval, music and text in duet dance, or hidden-space and mel-space controls in emotional TTS (Mitra et al., 2016, Xiong et al., 2017, Chen et al., 15 Apr 2026, Wang et al., 2024, Koley et al., 2024, Gupta et al., 23 Aug 2025, Zhang et al., 20 May 2026). Taken together, these works suggest that “Text-to-Duet” is not a single standardized algorithm, but a recurring design principle for converting text into coordinated dual representations.
1. Foundational duet formulations in retrieval
The earliest explicit “duet” formulation in this corpus is the web search model that combines local and distributed representations of text. “Learning to Match Using Local and Distributed Representations of Text for Web Search” defines a ranking model with two separate deep neural networks, one matching the query and document using a local representation and another using learned distributed representations, with the final score given by (Mitra et al., 2016). The local branch emphasizes exact term matches, position, and proximity; the distributed branch addresses vocabulary mismatch through learned semantic representations. The model uses the first 10 query terms and the first 1000 document terms, lowercases text, removes non-alphanumeric characters, applies a binary interaction matrix for exact matching, and employs character -graph hashing with and for the distributed branch (Mitra et al., 2016).
“Word-Entity Duet Representations for Document Ranking” extends the duet idea by treating entities as first-class signals alongside words. In this formulation, “Text-to-Duet is the process of turning plain text from a query and candidate documents into a joint word–entity representation and using their four-way interactions to score relevance” (Xiong et al., 2017). Queries and documents are represented in word space, with and , and in entity space, with and , where entities are linked by TagMe (Xiong et al., 2017). This yields four interaction types: –, 0–1, 2–3, and 4–5. The ranking decomposition is written as
6
The entity-aware formulation combines standard lexical retrieval features, entity textual attributes such as name and description, and TransE-based entity similarity. For entity-to-entity matching, TransE embeddings are used to compute 7, followed by histogram pooling over six bins: 8, 9, 0, 1, 2, and 3 (Xiong et al., 2017). AttR-Duet adds an attention mechanism over query entities, using ambiguity features such as entropy, is-most-popular-candidate, CMNS margin, and cosine similarity between entity and query embeddings, with pairwise hinge loss and Nadam optimization (Xiong et al., 2017). On TREC Web Track ad-hoc tasks, AttR-Duet reports NDCG@20 = 0.3197 and ERR@20 = 0.2026 on ClueWeb09-B, and NDCG@20 = 0.1376 and ERR@20 = 0.1154 on ClueWeb12-B13, significantly outperforming both word-based and entity-based learning-to-rank baselines (Xiong et al., 2017).
2. Joint textual profiling in recommendation
In recommendation, “DUET: Joint Exploration of User Item Profiles in Recommendation System” uses “text-to-DUET” to mean the construction of paired user and item textual profiles that are generated jointly rather than independently. The problem definition specifies user history 4 and item-side evidence 5, and models the paired generation distribution as
6
where 7 and 8 are compact cues and 9 is a discrete constructed profile prompt (Chen et al., 15 Apr 2026).
The procedure has three stages. First, cue creation distills raw histories and metadata into concise hypotheses such as “enjoys retro puzzle games” or “retro aesthetics.” Second, paired profile prompt expansion introduces a shared constructed prompt 0 that parameterizes profile construction logic for both sides. Third, profile generation is optimized with reinforcement learning using a frozen downstream recommender 1, a continuous reward
2
and Group Relative Policy Optimization (GRPO) (Chen et al., 15 Apr 2026). The downstream score is induced by textual profiles, 3, while post hoc semantic alignment is measured by cosine similarity between Sentence-Transformers embeddings of the two profiles (Chen et al., 15 Apr 2026).
This formulation targets two problems stated explicitly in the paper: manually designed templates are brittle, and independently generated user and item profiles may be semantically inconsistent for a specific user–item pair (Chen et al., 15 Apr 2026). Empirically, on Amazon Music, Amazon Books, and Yelp, DUET consistently surpasses KAR, R4Rec, and LG across rating prediction and ranking. With Qwen3-8B, the reported accuracies are 61.23 on Yelp, 67.96 on Music, and 64.38 on Books; corresponding NDCG@10 values are 0.6008, 0.7025, and 0.6599 (Chen et al., 15 Apr 2026). The paper also reports that DUET without RL drops to 48.53% accuracy on Yelp, compared with 61.23% for full DUET, indicating that reinforcement learning is central to the method rather than a minor refinement (Chen et al., 15 Apr 2026).
3. Streaming interaction and trajectory scheduling
A distinct meaning of “Text-to-Duet” appears in video-language modeling. “VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format” formalizes a duet as an alternation between a continuously playing video stream and inserted text turns from the user or assistant (Wang et al., 2024). The stream is represented as 4 for 5, while each text insertion is a tuple 6. The input sequence alternates among system, stream, user, and assistant roles, so the video itself functions as a participant that “has the floor” between text turns (Wang et al., 2024).
MMDuet introduces two frame-level signals during streaming: an informative score 7 and a relevance score 8. A task-specific gating policy decides when the model should respond. For dense captioning, the model accumulates informative scores until a threshold is passed; for MAGQA, it triggers when 9 exceeds a threshold; for highlight detection and temporal grounding, relevance is used directly and smoothed over a local window (Wang et al., 2024). MMDuetIT, the corresponding training set, contains 109k examples, and MAGQA is defined as a task where the assistant must produce multiple answers at appropriate times during continuous video playback (Wang et al., 2024). Reported results include mAP/HIT@1 = 31.3/49.6 on QVHighlights, [email protected]/[email protected] = 42.4/18.0 on Charades-STA, and SODAc/CIDEr/F1 = 2.9/8.8/21.7 on YouCook2 with repetition control (Wang et al., 2024).
A related scheduling interpretation appears in “DuET: Dual Expert Trajectories for Diffusion Image Editing,” where a temporary text-to-image sub-trajectory is inserted into an image editing trajectory (Troeshestova et al., 11 Jun 2026). The method alternates between an image-conditioned expert and a text-only expert in three phases: early edit mode, a mid-trajectory T2I window, and late edit mode. The paper describes the schedule as 0, with 1 denoising steps and model-agnostic default windows 2 and 3 (Troeshestova et al., 11 Jun 2026). The conditioning is piecewise:
4
This produces a predictable trade-off between source preservation and edit fidelity. For FLUX2-Klein 4B, DuET 5 improves GEdit overall from 7.11 to 7.54 and ImgEdit from 3.74 to 4.04, while SSIM drops from 0.815 to 0.749 (Troeshestova et al., 11 Jun 2026).
4. Sketch–text composition for fine-grained image retrieval
In fine-grained image retrieval, “You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval” defines a duet between sketch and text rather than between two textual profiles or a text-and-stream sequence (Koley et al., 2024). The central design choice is to compose the sketch and text inside CLIP’s text domain. A sketch image 6 is first encoded by CLIP’s image encoder 7, and then mapped by a three-layer MLP 8 into a single pseudo-word token:
9
This token is concatenated with three learnable prompt vectors and optional user text tokens, and passed through CLIP’s frozen text transformer:
0
The gallery photo embedding is 1, and retrieval is based on cosine similarity after 2 normalization (Koley et al., 2024).
The training procedure uses a compositionality proxy that avoids collecting fine-grained captions. A difference pseudo-word is formed as
3
and a compositionality loss enforces that adding this missing-information token brings the representation closer to the paired photo (Koley et al., 2024). The framework also uses neutral-text regularization, text-to-text prompt regularization, a region-aware triplet loss, and an auxiliary reconstruction loss. The total loss is
4
The reported results show large gains over Pic2Word and SEARLE. On QMUL-ShoeV2, the proposed system achieves 47.3 / 79.1 for Acc.@5 / Acc.@10, compared with 34.7 / 58.4 for Pic2Word and 38.4 / 64.8 for SEARLE; on Sketchy, it achieves 30.6 / 64.2 (Koley et al., 2024). The ablations identify the compositionality constraint as the largest contributor, with ShoeV2 Acc.@5 dropping from 47.3 to 32.5 when it is removed (Koley et al., 2024). This directly supports the paper’s claim that sketches capture geometry, pose, and part layout, while text contributes attributes such as color, material, and context.
5. Motion, dance, and dual-conditioned generation
In 3D motion generation, “MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning” uses “Text-to-Duet” to denote text-only inference in a model trained with video regularization (Zhang et al., 22 Nov 2025). MotionDuet represents motion as an 5 array of per-frame features, uses a CLIP text encoder for text prompts, a VideoMAE-based encoder for visual features, and fuses them through the DUET module
6
The architecture includes an FFT branch,
7
a Dynamic Mask Mechanism, and a DASH loss combining token-level and pairwise structural alignment (Zhang et al., 22 Nov 2025). The total training objective is
8
At inference in text-only mode, the model sets 9 and relies on auto-guidance derived from a weakened copy of the conditioning features (Zhang et al., 22 Nov 2025). On HumanML3D, the text-only setting reports 0, FID = 0.213, MM Dist = 3.176, DIV = 9.540, and MM = 2.464 (Zhang et al., 22 Nov 2025).
A duet formulation also appears in partnered dance. “MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation” defines Text-to-Duet as generating both leader and follower motion from a text description 1 and music features 2 (Gupta et al., 23 Aug 2025). The sample tuple is 3 with 4, and the learning objective is expressed as 5 (Gupta et al., 23 Aug 2025). MDD provides 10.34 hours of professional motion capture, 620.60 minutes, approximately 4.4 million frames, 15 genres, 30 dancers, and more than 10K fine-grained descriptions, with the paper’s table reporting 10,187 descriptions (Gupta et al., 23 Aug 2025). Music is represented either by 54-dimensional MFCCs or 4800-dimensional Jukebox embeddings.
The baseline models are adapted versions of MDM and InterGen. InterGen with Jukebox embeddings is the strongest baseline in the table, with 6, FID = 0.410, MM Dist = 1.396, Diversity = 1.388, MModality = 1.330, BED = 0.454, and BAS = 0.184 (Gupta et al., 23 Aug 2025). The paper explicitly warns that BAS can be spuriously high for jittery motions and states that BED correlates more faithfully with realism (Gupta et al., 23 Aug 2025). This is an important correction to a common misreading of music-conditioned dance metrics: high beat alignment alone is not treated as sufficient evidence of high-quality duet interaction.
6. Music, dialogue speech, and dual-space speech control
Several recent systems use “Text-to-Duet” for audio co-creation. “Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations” defines the task as generating a 10-second accompaniment from a vocal track and a natural-language prompt (Trinh et al., 2024). Llambada factorizes the conditional distribution into a semantic stage and an acoustic stage:
7
Semantic tokens come from MERT at 50 Hz with a 1024-entry 8-means codebook; acoustic tokens come from Encodec at 75 tokens/s; CLAP text embeddings are quantized by 12 quantizers with codebook size 1024 (Trinh et al., 2024). On in-domain evaluation, Llambada reports FAD_VGGish 3.156, FAD_CLAP-music 0.679, FAD_mean 1.918, and CLAP score 0.244, outperforming SingSong and FastSAG (Trinh et al., 2024).
“Improving Musical Accompaniment Co-creation via Diffusion Transformers” addresses a related accompaniment problem through latent diffusion, a stereo autoencoder, a Diffusion Transformer, and a cross-modality predictive network CLAPβ (Nistal et al., 2024). The system conditions on a music context and optional CLAP embeddings, and uses consistency training for five-step inference. Under 9 conditioning, the +DiT variant reports KD 0.90, FAD 0.33, Coverage 0.73, Density 5.94, APA 1.00, 0 0.53, and 1 0.18; with CLAPβ, Coverage rises to 0.92 and Density to 11.1 (Nistal et al., 2024). The paper’s practical “Text-to-Duet” usage pattern is sequential: generate a lead with CLAP-only conditioning, then use that lead as context to synthesize the companion part (Nistal et al., 2024).
In dialogue speech, “DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching” operationalizes text-to-duet as dual-track speech synthesis from a dialogue script (Xie et al., 9 Oct 2025). DiaLM generates two synchronized semantic token streams, one per speaker, with a special <SIL> token encoding inactivity and overlap emerging when both streams emit non-silence tokens simultaneously (Xie et al., 9 Oct 2025). The token-to-waveform stage uses Chunked Flow Matching with the loss
2
The training pipeline aggregates 10,000 hours: 3,000 hours of professionally recorded Chinese dialogues, 5,000 hours of spontaneous Chinese podcasts, and 2,000 hours of English Fisher telephone dialogues (Xie et al., 9 Oct 2025). Subjective evaluation shows higher spontaneity than CosyVoice2 and CoVoMix in English, and higher spontaneity than MoonCast and CosyVoice2 in Chinese, while objective intelligibility remains stronger for CosyVoice2 on several metrics (Xie et al., 9 Oct 2025).
Finally, “DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech” applies the duet principle inside a pretrained iterative TTS sampler rather than between two speakers or two musical parts (Zhang et al., 20 May 2026). The paper states that emotion embedding emerges as a linearly decodable direction of frozen hidden states that is nearly orthogonal to the direction embedding speaker identity, with 3 on F5-TTS at the shared probing layer (Zhang et al., 20 May 2026). DUET performs a per-step hidden-space steering update,
4
and a mel-space refinement,
5
with a cosine mid-trajectory schedule for 6 (Zhang et al., 20 May 2026). Across ESD, CREMA-D, and IEMOCAP, four of five DUET+backbone combinations exceed all ten supervised emotional TTS baselines in average SER accuracy, and the method achieves the highest human-rated emotion appropriateness, with EMOS 3.93 versus 3.75 for Qwen3-TTS and 3.32 for CosyVoice2 (Zhang et al., 20 May 2026).
Across these domains, the recurring claim is not merely that text conditions generation, but that text becomes effective when paired with a second structured channel: entities, item evidence, video time, sketches, music, a second speaker, or hidden-state control. This suggests that “Text-to-Duet” names a broader methodological shift from single-stream prompting toward explicitly coupled representations, synchronized trajectories, and dual-space control.