AutoMV: Dual Automation for MV & Refactoring

Updated 20 December 2025

AutoMV is a versatile system with two implementations: one for generating music videos using multi-agent orchestration and another for automated MoveMethod refactoring.
Its music video pipeline employs modular agents for music feature extraction, script generation, scene planning, and iterative visual verification to ensure narrative coherence.
The LLM-driven refactoring tool utilizes chain-of-thought prompting and semantic ranking, delivering efficient method move recommendations and improved code quality.

AutoMV encompasses two distinct research lines with the same acronym in the recent literature: (1) a multi-agent system for automatic music video (MV) generation from audio and lyrics, and (2) a LLM-driven assistant for the automated Move Method refactoring in codebases. Each instantiation addresses autonomous orchestration in multi-step creative or engineering workflows, but their methodologies and technical domains are orthogonal. Both research streams have contributed significant results to their respective fields.

1. Multi-Agent System for Music Video Generation

AutoMV for Music-to-Video (M2V) generation is a training-free, modular multi-agent pipeline designed to generate full-length, structurally coherent music videos directly from song input by orchestrating specialized agents and off-the-shelf multimodal APIs. It addresses the principal shortcomings of prior approaches, namely the generation of short, disjointed clips with poor alignment to musical structure, limited lyric-visual correlation, and a lack of temporal and narrative continuity (Tang et al., 13 Dec 2025).

The architecture comprises six agentic modules—Music Preprocessor, Screenwriter Agent, Director Agent, Image Generator, Video Generator, and Verifier Agent—and a shared external character bank. Each agent ingests, transforms, and relays domain-specific representations (music features, scripts, prompts, keyframes, etc.) through defined interaction protocols to ensure data and thematic continuity.

System Workflow

The end-to-end generation pipeline is summarized by the following high-level pseudocode excerpt:

function AutoMV_Generate(full_song_audio):
    # Stage 1: Music Preprocessing
    music_features = MusicPreprocessor(full_song_audio)

    # Stage 2: Scripting
    shots, script, char_bank = ScreenwriterAgent(music_features)

    # Stage 3: Shot Planning & Keyframe
    keyframe_prompts, shot_prompts = DirectorAgent(script, char_bank)

    # Stage 4: Rendering
    clips = []
    for each shot_i in shots:
        keyframe = ImageGenerator(keyframe_prompts[i])
        if shot_i.requires_lipsync:
            clip = VideoGenerator_LipSync(keyframe, shot_prompts[i], music_features.vocals)
        else:
            clip = VideoGenerator_Cinematic(keyframe, shot_prompts[i])
        clips.append(clip)

    # Stage 5: Verification & Iteration
    final_clips = []
    for each clip in clips:
        verified = VerifierAgent(clip, script_segment_for_clip)
        if not verified:
            regenerate clip  # up to N attempts
        final_clips.append(best_candidate)

    # Stage 6: Assembly
    full_MV = concatenate(final_clips)
    return full_MV

2. Music Feature Extraction and Scene Planning

AutoMV performs hierarchical music feature extraction to inform downstream visual planning. Input audio undergoes:

High-level captioning (genre, mood, instrumentation, vocalist attributes) via Qwen2.5-Omni.
Structural segmentation (intro/verse/chorus/bridge) and timestamping via SongFormer.
Vocal/audio stem separation using htdemucs.
Lyrics transcription and alignment using Whisper and Gemini web refinement.

This layered context is subsequently used by the Screenwriter Agent (Gemini) to define shot boundaries and generate scene descriptions, emotional tones, and character actions. The Director Agent (Doubao) builds upon these to issue detailed camera, environment, and actor instructions, as well as keyframe image prompts.

3. Visual and Video Synthesis

AutoMV integrates two categories of video generation backends:

Image Generation: Doubao image API produces keyframes (~1024×1024, upsampled if needed), enforcing character bank constraints for visual continuity.
Video Generation:
- Doubao cinematic API stitches together 3–8 s subclips for "story" scenes, merging 1–3 subclips per shot.
- Qwen-Wan-2.2 handles lip-sync segments, combining generated facial motion with separated vocals.

Temporal consistency utilizes explicit keyframe reuse and shared character descriptors; no learned diffusion loss is used, with soft constraints imposed via agent coordination.

4. Verification and Iterative Collaboration

Verification mechanisms are agentic and multi-level, employing Gemini 2.5 Pro for evaluative checks. Videoclips and keyframes are assessed for:

Physical realism (pose, lighting, artifact absence).
Script adherence (actions, scene, character match).
Identity continuity (matching character bank).
Text-image/audio alignment (action–lyric/beat coherence).

Candidates are scored (1–5), with tight thresholds (e.g., θ = 3.0) and up to 3 candidate generations and 2 re-tries. The highest scoring candidate per shot is accepted or regenerated under failure.

5. Evaluation Benchmarks and Results

AutoMV introduced a multifaceted evaluation protocol including four high-level categories (Technical, Post-Production, Content, Art) subdivided into twelve criteria. These are assessed by LLM-based automatic judges (Gemini-2.5-Pro/Flash, Gemini-3-Pro, Qwen-Omni) and expert human raters on a 1–5 scale. Quantitative results on a 30-song benchmark:

Method	Cost/Run	ImageBind (%)	LLM Total	Human Expert Total
Revid.ai-base	$10	19.9	4.0	1.1–1.3
OpenArt-story	$20–40	18.5	4.2	1.1–2.4
AutoMV (full)	$10–20	24.4	4.5	2.0–2.9
Human (expert)	>$10k	24.1	4.6	2.2–3.8

AutoMV outperforms both commercial baselines in audio-visual semantic consistency (ImageBind) and human expert scoring, narrowing the gap to professional human-produced MVs.

Ablation studies indicate a significant drop in performance if lyric context, character bank, or the verifier agent is ablated (expert scores: 1.8–2.1), underlining their criticality for musical, thematic, and identity continuity (Tang et al., 13 Dec 2025).

6. Key Insights, Limitations, and Future Directions

Music-aware structural planning and scene synthesis are crucial to bridging the gap to expert MVs. The character bank and keyframe reuse drive identity consistency (CC from 3.07 to 1.22 if removed), and the verifier agent lifts visual and technical quality by approximately 0.7 points. Limitations remain, including occasional physically implausible motion, inconsistent in-scene text rendering, and imperfect lip-sync modeling.

Promising future directions include explicit physics constraints, dedicated beat-to-motion agents for improved choreography, enhanced multimodal evaluators with longer context windows, and text-aware video generative controls.

7. LLM-Based Automated Move Method Refactoring (Alternative Use of “AutoMV”)

A distinct AutoMV system, also known as M²aide, is a fully automated LLM-powered assistant for automated MoveMethod refactoring in code bases (Batole et al., 26 Mar 2025). Designed for software engineering, its workflow is:

Sanity Filtering: Exclude non-movable or semantically anchored methods via static analysis in IDEs.
Semantic Ranking: Compute VoyageAI code embeddings and cosine similarity to discriminate poorly cohesive methods.
Refactoring-Aware RAG: Two-stage candidate retrieval for method move targets—mechanical feasibility checks followed by semantic and proximity-based ranking.
Chain-of-Thought LLM Prompting: LLMs (temperature=0) are prompted to rank, critique, and justify move recommendations.
Presentation and Execution: Recommendations with rationale are surfaced to the developer, who may apply the refactoring in-IDE via formal APIs.

AutoMV’s hallucination filtering excludes nonexistent targets, infeasible moves, and method types that risk semantic breakage. On both synthetic and real OSS benchmark corpora, AutoMV averages 1.7–2.4× higher Recall@1 and Recall@3 than prior systems, with 82.8% positive ratings in user studies over 350 Java classes. Runtime per class is ≈30 seconds, contrasting with the hours required by legacy tools (Batole et al., 26 Mar 2025).

8. Conclusion

AutoMV, in both music video synthesis and automated code refactoring, exemplifies the integration of agent-based planning, multimodal processing, and LLM reasoning in creative and engineering pipelines. Each system demonstrates substantive advances—addressing musical/visual semantic alignment or robust move refactorings at expert-level precision and speed. Both frameworks highlight the necessity of contextual filtration, verification, and iterative refinement in achieving automated outputs that closely approach human expert performance.

Markdown Upgrade to Chat

References (2)

AutoMV: An Automatic Multi-Agent System for Music Video Generation (2025)

Leveraging LLMs, IDEs, and Semantic Embeddings for Automated Move Method Refactoring (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoMV.

AutoMV: Dual Automation for MV & Refactoring

1. Multi-Agent System for Music Video Generation

System Workflow

2. Music Feature Extraction and Scene Planning

3. Visual and Video Synthesis

4. Verification and Iterative Collaboration

5. Evaluation Benchmarks and Results

6. Key Insights, Limitations, and Future Directions

7. LLM-Based Automated Move Method Refactoring (Alternative Use of “AutoMV”)

8. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

AutoMV: Dual Automation for MV & Refactoring

1. Multi-Agent System for Music Video Generation

System Workflow

2. Music Feature Extraction and Scene Planning

3. Visual and Video Synthesis

4. Verification and Iterative Collaboration

5. Evaluation Benchmarks and Results

6. Key Insights, Limitations, and Future Directions

7. LLM-Based Automated Move Method Refactoring (Alternative Use of “AutoMV”)

8. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research