AutoMV: Dual Automation for MV & Refactoring
- AutoMV is a versatile system with two implementations: one for generating music videos using multi-agent orchestration and another for automated MoveMethod refactoring.
- Its music video pipeline employs modular agents for music feature extraction, script generation, scene planning, and iterative visual verification to ensure narrative coherence.
- The LLM-driven refactoring tool utilizes chain-of-thought prompting and semantic ranking, delivering efficient method move recommendations and improved code quality.
AutoMV encompasses two distinct research lines with the same acronym in the recent literature: (1) a multi-agent system for automatic music video (MV) generation from audio and lyrics, and (2) a LLM-driven assistant for the automated Move Method refactoring in codebases. Each instantiation addresses autonomous orchestration in multi-step creative or engineering workflows, but their methodologies and technical domains are orthogonal. Both research streams have contributed significant results to their respective fields.
1. Multi-Agent System for Music Video Generation
AutoMV for Music-to-Video (M2V) generation is a training-free, modular multi-agent pipeline designed to generate full-length, structurally coherent music videos directly from song input by orchestrating specialized agents and off-the-shelf multimodal APIs. It addresses the principal shortcomings of prior approaches, namely the generation of short, disjointed clips with poor alignment to musical structure, limited lyric-visual correlation, and a lack of temporal and narrative continuity (Tang et al., 13 Dec 2025).
The architecture comprises six agentic modules—Music Preprocessor, Screenwriter Agent, Director Agent, Image Generator, Video Generator, and Verifier Agent—and a shared external character bank. Each agent ingests, transforms, and relays domain-specific representations (music features, scripts, prompts, keyframes, etc.) through defined interaction protocols to ensure data and thematic continuity.
System Workflow
The end-to-end generation pipeline is summarized by the following high-level pseudocode excerpt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
function AutoMV_Generate(full_song_audio):
# Stage 1: Music Preprocessing
music_features = MusicPreprocessor(full_song_audio)
# Stage 2: Scripting
shots, script, char_bank = ScreenwriterAgent(music_features)
# Stage 3: Shot Planning & Keyframe
keyframe_prompts, shot_prompts = DirectorAgent(script, char_bank)
# Stage 4: Rendering
clips = []
for each shot_i in shots:
keyframe = ImageGenerator(keyframe_prompts[i])
if shot_i.requires_lipsync:
clip = VideoGenerator_LipSync(keyframe, shot_prompts[i], music_features.vocals)
else:
clip = VideoGenerator_Cinematic(keyframe, shot_prompts[i])
clips.append(clip)
# Stage 5: Verification & Iteration
final_clips = []
for each clip in clips:
verified = VerifierAgent(clip, script_segment_for_clip)
if not verified:
regenerate clip # up to N attempts
final_clips.append(best_candidate)
# Stage 6: Assembly
full_MV = concatenate(final_clips)
return full_MV |
2. Music Feature Extraction and Scene Planning
AutoMV performs hierarchical music feature extraction to inform downstream visual planning. Input audio undergoes:
- High-level captioning (genre, mood, instrumentation, vocalist attributes) via Qwen2.5-Omni.
- Structural segmentation (intro/verse/chorus/bridge) and timestamping via SongFormer.
- Vocal/audio stem separation using htdemucs.
- Lyrics transcription and alignment using Whisper and Gemini web refinement.
This layered context is subsequently used by the Screenwriter Agent (Gemini) to define shot boundaries and generate scene descriptions, emotional tones, and character actions. The Director Agent (Doubao) builds upon these to issue detailed camera, environment, and actor instructions, as well as keyframe image prompts.
3. Visual and Video Synthesis
AutoMV integrates two categories of video generation backends:
- Image Generation: Doubao image API produces keyframes (~1024×1024, upsampled if needed), enforcing character bank constraints for visual continuity.
- Video Generation:
- Doubao cinematic API stitches together 3–8 s subclips for "story" scenes, merging 1–3 subclips per shot.
- Qwen-Wan-2.2 handles lip-sync segments, combining generated facial motion with separated vocals.
Temporal consistency utilizes explicit keyframe reuse and shared character descriptors; no learned diffusion loss is used, with soft constraints imposed via agent coordination.
4. Verification and Iterative Collaboration
Verification mechanisms are agentic and multi-level, employing Gemini 2.5 Pro for evaluative checks. Videoclips and keyframes are assessed for:
- Physical realism (pose, lighting, artifact absence).
- Script adherence (actions, scene, character match).
- Identity continuity (matching character bank).
- Text-image/audio alignment (action–lyric/beat coherence).
Candidates are scored (1–5), with tight thresholds (e.g., θ = 3.0) and up to 3 candidate generations and 2 re-tries. The highest scoring candidate per shot is accepted or regenerated under failure.
5. Evaluation Benchmarks and Results
AutoMV introduced a multifaceted evaluation protocol including four high-level categories (Technical, Post-Production, Content, Art) subdivided into twelve criteria. These are assessed by LLM-based automatic judges (Gemini-2.5-Pro/Flash, Gemini-3-Pro, Qwen-Omni) and expert human raters on a 1–5 scale. Quantitative results on a 30-song benchmark:
| Method | Cost/Run | ImageBind (%) | LLM Total | Human Expert Total |
|---|---|---|---|---|
| Revid.ai-base | $10 | 19.9 | 4.0 | 1.1–1.3 |
| OpenArt-story | $20–40 | 18.5 | 4.2 | 1.1–2.4 |
| AutoMV (full) | $10–20 | 24.4 | 4.5 | 2.0–2.9 |
| Human (expert) | >$10k | 24.1 | 4.6 | 2.2–3.8 |
AutoMV outperforms both commercial baselines in audio-visual semantic consistency (ImageBind) and human expert scoring, narrowing the gap to professional human-produced MVs.
Ablation studies indicate a significant drop in performance if lyric context, character bank, or the verifier agent is ablated (expert scores: 1.8–2.1), underlining their criticality for musical, thematic, and identity continuity (Tang et al., 13 Dec 2025).
6. Key Insights, Limitations, and Future Directions
Music-aware structural planning and scene synthesis are crucial to bridging the gap to expert MVs. The character bank and keyframe reuse drive identity consistency (CC from 3.07 to 1.22 if removed), and the verifier agent lifts visual and technical quality by approximately 0.7 points. Limitations remain, including occasional physically implausible motion, inconsistent in-scene text rendering, and imperfect lip-sync modeling.
Promising future directions include explicit physics constraints, dedicated beat-to-motion agents for improved choreography, enhanced multimodal evaluators with longer context windows, and text-aware video generative controls.
7. LLM-Based Automated Move Method Refactoring (Alternative Use of “AutoMV”)
A distinct AutoMV system, also known as M²aide, is a fully automated LLM-powered assistant for automated MoveMethod refactoring in code bases (Batole et al., 26 Mar 2025). Designed for software engineering, its workflow is:
- Sanity Filtering: Exclude non-movable or semantically anchored methods via static analysis in IDEs.
- Semantic Ranking: Compute VoyageAI code embeddings and cosine similarity to discriminate poorly cohesive methods.
- Refactoring-Aware RAG: Two-stage candidate retrieval for method move targets—mechanical feasibility checks followed by semantic and proximity-based ranking.
- Chain-of-Thought LLM Prompting: LLMs (temperature=0) are prompted to rank, critique, and justify move recommendations.
- Presentation and Execution: Recommendations with rationale are surfaced to the developer, who may apply the refactoring in-IDE via formal APIs.
AutoMV’s hallucination filtering excludes nonexistent targets, infeasible moves, and method types that risk semantic breakage. On both synthetic and real OSS benchmark corpora, AutoMV averages 1.7–2.4× higher Recall@1 and Recall@3 than prior systems, with 82.8% positive ratings in user studies over 350 Java classes. Runtime per class is ≈30 seconds, contrasting with the hours required by legacy tools (Batole et al., 26 Mar 2025).
8. Conclusion
AutoMV, in both music video synthesis and automated code refactoring, exemplifies the integration of agent-based planning, multimodal processing, and LLM reasoning in creative and engineering pipelines. Each system demonstrates substantive advances—addressing musical/visual semantic alignment or robust move refactorings at expert-level precision and speed. Both frameworks highlight the necessity of contextual filtration, verification, and iterative refinement in achieving automated outputs that closely approach human expert performance.