LPM 1.0: Video-based Character Performance Model
Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.
First 10 authors:
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What this paper is about
This paper introduces LPM 1.0 (Large Performance Model), an AI system that can turn a single picture of a person into a long, realistic video of that person having a conversation—speaking, listening, reacting, and showing emotions—in real time. The goal is to make on-screen characters feel like real, attentive conversational partners, not just moving mouths.
The big questions the researchers asked
- How can we make a character’s performance feel expressive and natural (not just lip sync), while still running fast enough for live conversations?
- How do we keep the character’s identity consistent over long videos, so the face and style don’t drift or change?
- Can one system handle both sides of a conversation—speaking and listening—at the same time, and do it forever without breaking?
The authors call this the “performance trilemma”: balancing expressiveness, real-time speed, and long-term stability all at once.
How they approached the problem
To solve the trilemma, the team designed the whole system end-to-end, not just a single model. Here’s the approach in simple terms:
- Data built for conversation: They carefully collected and filtered video and audio of people talking and listening. They paired “speaking” clips with matching “listening” clips, added descriptions about performance (like emotions and reactions), and gathered multiple reference images of each person to lock in identity. Think of it like giving the AI a scrapbook of a person plus examples of how people behave in real conversation.
- A large “Base” model that learns performance: They trained a very big AI model (about 17 billion parameters) called a Diffusion Transformer.
- Diffusion, simplified: Imagine starting with a noisy, blurry video and gradually “cleaning it up,” guided by clues like voice, text, and identity images. That’s how the model creates realistic motion frame by frame.
- Transformer: A type of AI good at paying attention to many signals at once (voice, text, images) and using context over time.
- Multimodal conditioning: The model takes in different inputs—audio for timing and tone, text prompts for motion control, and reference images for identity—and combines them to generate the performance.
- Teaching a smaller, faster “Online” model: They “distilled” the big model into a lighter, streaming version.
- Distillation: Like a student learning from a teacher—copying the teacher’s behavior to become quicker at test time.
- Causal streaming generator: It makes video one moment at a time in order, so it can run live with low delay, and can keep going as long as needed.
- Real-time, full-duplex conversation: At run time, you give the system:
- A character image and identity references,
- Audio from the user to generate listening reactions,
- Synthesized speech audio to generate speaking behavior,
- Optional text prompts to nudge motions or style (for example, “more energetic gestures”).
- The system then creates smooth, identity-consistent video of the character both listening and speaking, with natural timing and expressions, at real-time speed.
- A new test called LPM-Bench: They built a benchmark to measure how well systems perform in interactive character settings—tracking things like lip sync, expressiveness, reaction timing, and identity stability across long videos.
What they found and why it matters
- It resolves the “performance trilemma” in practice: By co-designing the data, model, and streaming system together, LPM 1.0 can be expressive, fast, and stable over long periods—at the same time.
- State-of-the-art results: On their new LPM-Bench and other tests, LPM 1.0 outperforms previous systems across key measures, while still running in real time.
- Identity stays consistent: Even during long, continuous generation, the character’s face and overall look don’t drift.
- Natural conversational behavior: The system doesn’t just move lips; it listens, reacts, shows micro-expressions, and times responses in a socially believable way.
This matters because it turns video generation into a reliable “visual engine” for interactive characters—useful for virtual assistants, live stream avatars, and game NPCs that need to feel present and responsive.
What it could lead to next
- Short term: Better virtual presenters, streamers, and game characters that can look and feel engaged, not just animated.
- Longer term: The authors point to three growth areas:
- Time: Remembering what happened earlier so a character’s behavior stays consistent across long conversations and stories.
- Social: Handling multi-person chats—tracking who’s talking, who’s being addressed, and how to share turns smoothly.
- Physical: Making characters aware of 3D spaces, objects, and actions so their behavior fits the environment.
They also note limits: LPM 1.0 currently focuses on a single, camera-facing person and doesn’t deeply track long-term memories or complex environments yet. But it shows that high-quality, real-time, long-running conversational video is possible today, paving the way for richer, more believable interactive characters.
Knowledge Gaps
Below is a single, consolidated list of concrete gaps, limitations, and open questions that remain unresolved in the paper. These are framed to guide future research and experimentation.
- Generalization beyond single, camera-facing characters: how to handle arbitrary viewpoints, moving cameras, occlusions, and full-body actions without losing identity stability or real-time performance.
- Multi-party interaction: methods for addressee detection, gaze allocation, overlap handling, interruption/barge-in management, and group-level turn-taking in full-duplex settings.
- Long-horizon discourse memory: mechanisms for persistent persona, recall of prior events across sessions, and behavioral coherence over hours-long interactions (including drift detection and correction).
- Physical/world grounding: integrating 3D scene understanding, object interactions, contact dynamics, and consistent behavior under environment changes.
- End-to-end actor modeling: feasibility and benefits of replacing the current pipeline (LLM → TTS → video generation → stabilization) with unified models that jointly decide what to say, how to say it, and how to perform.
- Robustness to real-world audio: performance under background noise, reverberation, variable microphones, latency jitter, packet loss, and cross-talk in live scenarios.
- Language and accent coverage: support and evaluation across languages, accents, code-switching, and speech disorders, including prosody, phoneme coverage, and lip–audio synchronization fidelity.
- Listening vs speaking asymmetry: the system generates listening video from user audio but speaking video from synthesized audio—can it generate high-quality speaking video from arbitrary user-recorded speech, and what are the trade-offs?
- Expressive control granularity: limits of text-prompted motion control; how to specify fine-grained timing, intensity, and multimodal constraints when audio and text prompts conflict.
- Emotion and affect modeling: reliability of emotion/micro-expression control, temporal consistency of affect, and disentanglement between identity, style, and emotional state.
- Identity-aware references: required quantity, diversity (poses/expressions), and quality of references; robustness to low-quality or mismatched references; automatic reference selection strategies and their effect on identity stability.
- Identity leakage and memorization: risk of training–test identity overlap, face memorization, and unintended identity reconstruction; protocols to measure and mitigate identity leakage.
- Infinite-length generation claims: quantitative evidence for stability over very long sessions (e.g., multi-hour) and strategies for preventing cumulative drift or artifacts.
- Distillation trade-offs: formal characterization of quality/latency trade-offs between the 17B Base LPM and the Online LPM; boundaries where streaming degradation becomes unacceptable.
- Hardware and scalability: resource requirements (GPU type, memory), throughput/latency scaling under different hardware, energy consumption, and feasibility on edge or mobile devices.
- Failure modes and recovery: detection and recovery from desynchronization, identity drift, facial/pose glitches, and temporal artifacts during live streaming without restarting the session.
- Dataset transparency: detailed composition, licensing, consent, demography, languages, recording conditions, and potential biases; whether the dataset is released and how others can reproduce or audit it.
- Data sourcing and filtering: validation of “strict filtering” and “speaking–listening pairing” procedures; label quality, noise rates, and their impact on downstream performance.
- Cross-cultural nonverbal behavior: coverage and evaluation of culturally specific gestures, gaze norms, and interpersonal distance cues; methods for culture-aware performance.
- LPM-Bench validity: public release, task and metric definitions, inter-rater reliability of human evaluations, correlation with user satisfaction, and robustness to gaming.
- Out-of-distribution generalization: performance on identities, styles, camera setups, and environments not seen in training; stress tests for extreme poses, lighting, occlusions, and attire.
- Safety and misuse prevention: concrete mechanisms for consent verification, anti-impersonation safeguards, watermarking, detection of synthetic content, and abuse mitigation in real-time systems.
- Ethical personalization: controls for persona shaping without stereotyping; user agency and transparency when adapting style/emotion to user preferences or histories.
- Integration with conversational reasoning: aligning performance timing and affect with LLM reasoning latency; strategies for anticipation and backchanneling when language generation is uncertain or delayed.
- Evaluation of social legibility: standardized measures for listening behavior, timing of nods/backchannels, and appropriateness of reactions; cross-comparison with human baselines.
- Learning from interaction: online adaptation or reinforcement learning from user feedback to improve timing, expressivity, and social cues without destabilizing identity or safety.
- Compositional control: combining multiple constraints (text, audio prosody, explicit motion cues) at once; conflict resolution and priority schemes with predictable outcomes.
- Privacy guarantees: formal privacy analyses (e.g., PII leakage, membership inference) for both training data and user-provided media during live sessions.
- Content moderation: real-time detection and handling of unsafe, copyrighted, or disallowed content in audio prompts, generated speech, and video performance.
- Reproducibility and openness: clarity on the availability of code, models, and benchmarks; seeds and protocols for replicable latency and quality measurements across labs.
Practical Applications
Immediate Applications
Below are practical, deployable use cases that can be implemented with LPM 1.0 as described (single, camera-facing character; identity-stable, real-time, full‑duplex audio-visual conversation; multimodal conditioning; Online LPM for streaming).
- Customer support and sales avatars (Industry: software, retail, telecom)
- Use case: Real-time, on-website or in-app customer service reps that speak and react with natural micro-expressions and attentive listening while the user talks.
- Workflow/product: LLM (dialogue) → TTS (agent speech) → Online LPM (speaking video) + user audio → Online LPM (listening video); delivered via WebRTC in contact center platforms (e.g., Genesys, Twilio Flex).
- Dependencies/assumptions: Reliable low-latency TTS/ASR; GPU inference for real-time; licensed identity/likeness; single-person, camera-facing framing.
- Virtual presenters for e-commerce and marketing (Industry: advertising, e-commerce)
- Use case: Brand avatars that deliver product demos, promotions, and FAQs with consistent identity and expressive behavior.
- Workflow/product: CMS + promptable motion control (text prompts) to align gestures with brand guidelines; LPM powers visual layer.
- Dependencies/assumptions: Scripted content pipeline; style-safe motion prompts; brand compliance and watermarking.
- VTubers and live-streaming co-hosts (Industry: media/entertainment)
- Use case: Automated co-hosts or digital doubles that can listen, react, and present content live; creators maintain persona without being on camera.
- Workflow/product: OBS/VTuber suite plugin using Online LPM; integrates chat-driven LLM, TTS, and face-reference conditioning for identity stability.
- Dependencies/assumptions: Adequate compute on creator’s PC or cloud; identity/voice rights; moderation tools.
- Dynamic game NPCs with in-session dialogue (Industry: gaming)
- Use case: In-game characters that react to player speech and deliver lines with synchronized expressions in cutscenes or dialogue hubs.
- Workflow/product: Unreal/Unity plugin; LLM for branching dialogue + TTS → LPM for NPC face/body performance; streamed to players.
- Dependencies/assumptions: Mostly frontal shots or kiosk-like NPCs; latency budgets for cloud or edge GPUs; limited viewpoint changes.
- Education: conversational tutors and language partners (Sector: education)
- Use case: On-demand tutors that maintain identity, provide feedback, and model conversational norms (turn-taking, backchannels) in real time.
- Workflow/product: LMS integration; curriculum-aligned prompts for motion; real-time listening behaviors to encourage learner engagement.
- Dependencies/assumptions: Age-appropriate safety filters; culturally sensitive nonverbal behaviors; institutional approval.
- Dubbing and ADR for postproduction (Industry: film/TV, localization)
- Use case: Generate lip-synchronized, identity-consistent talking footage from dubbed audio for reshoots or multi-language releases.
- Workflow/product: Audio (dubbed) → LPM speaking video; multi-reference identity inputs to match actor; batch or near-real-time render.
- Dependencies/assumptions: Rights from actors/unions; high-quality voice tracks; camera-facing constraints; postproduction QC.
- Corporate training simulations (Industry: HR, L&D)
- Use case: Scenario-based training (e.g., negotiation, customer care) with lifelike interlocutors showing appropriate affect and reactions.
- Workflow/product: LPM-driven avatars integrated in scenario engines; evaluation rubrics track learner interaction.
- Dependencies/assumptions: Curated scripts; safety guardrails; GPU serving costs.
- Telepresence with privacy-preserving avatars (Industry: enterprise software)
- Use case: Replace user’s live video with a consistent avatar that mirrors speech and basic expressions while masking identity.
- Workflow/product: Video conferencing plugin; user audio → LPM speaking/listening videos; optional style identity template.
- Dependencies/assumptions: Consent from participants; bandwidth matching; robustness to varied acoustics.
- Digital concierges and kiosks (Industry: hospitality, retail, public services)
- Use case: On-prem interactive screens with natural attendants that listen and respond with socially legible behaviors.
- Workflow/product: Edge device with GPU; ASR/TTS stack; multilingual prompts; on-device LPM streaming.
- Dependencies/assumptions: Edge compute and thermal budgets; noise-robust ASR; adherence to accessibility standards.
- Human–computer interaction research and benchmarking (Academia/Industry R&D)
- Use case: Evaluate agents’ turn-taking, gaze, and affect using LPM-Bench; study nonverbal behaviors and user perception.
- Workflow/product: Adopt LPM-Bench as a standardized evaluation suite; extend with task-specific metrics.
- Dependencies/assumptions: Reproducible test harness; IRB approvals for user studies where needed.
- Accessibility and social skills rehearsal (Sector: healthcare/education)
- Use case: Role-play partners for practicing interviews, presentations, or social cues with responsive feedback.
- Workflow/product: Therapy or coaching apps integrating LPM avatars; session scripts + motion prompts.
- Dependencies/assumptions: Clinical oversight; content sensitivity; data privacy.
- Developer tools and integrations (Software tooling)
- Use case: SDKs/plugins for Unreal/Unity, OBS, and Web APIs to embed LPM streaming in products.
- Workflow/product: Inference servers with REST/WebSocket; prebuilt workflows for LLM+TTS+LPM; monitoring dashboards.
- Dependencies/assumptions: GPU autoscaling; SLA targets; compliance logging/watermarking.
Long-Term Applications
These opportunities require further research, scaling, or system integration beyond LPM 1.0’s current scope (e.g., multi-party interaction, 3D/world grounding, long-horizon memory).
- Unified “actor models” that co-decide content and performance (Industry: software, media)
- Vision: A single model that plans dialogue, voice, and audiovisual performance jointly over time.
- Sector tools/products: End-to-end agent frameworks replacing separate LLM/TTS/LPM modules; story-driven performance engines for games and film.
- Dependencies/assumptions: Large-scale multimodal training; controllable long-horizon narrative memory; safety and interpretability.
- Multi-party, group-aware conversational agents (Industry: collaboration, education)
- Vision: Agents handling addressee tracking, gaze allocation, and group turn-taking in meetings and classrooms.
- Sector tools/products: Meeting assistants that coordinate with multiple participants; group tutoring avatars.
- Dependencies/assumptions: Social dynamics datasets; evaluation frameworks for group interaction; improved real-time gaze/navigation control.
- 3D/scene-grounded embodied characters (Industry: robotics, AR/VR, gaming)
- Vision: Characters that understand and act within 3D environments, with consistent behavior from arbitrary viewpoints and during movement/contact.
- Sector tools/products: AR assistants; robot faces/screens synchronized with physical actions; VR telepresence with world-locked avatars.
- Dependencies/assumptions: Strong 3D priors, multi-view training, body/scene contact modeling; latency-robust sensor fusion.
- Telemedicine and behavioral health avatars (Sector: healthcare)
- Vision: Empathetic clinician avatars with long-term persona persistence and session-spanning memory.
- Sector tools/products: Intake triage, adherence coaching, CBT role-play with regulated oversight.
- Dependencies/assumptions: Clinical validation, regulatory compliance (HIPAA/GDPR), bias mitigation, crisis-handling protocols.
- Personalized, persistent education companions (Sector: education)
- Vision: Lifelong learning avatars that remember learner history and adapt teaching style with consistent persona and nonverbal scaffolding.
- Sector tools/products: District-wide deployments integrated with SIS/LMS; analytics for engagement.
- Dependencies/assumptions: Data privacy and consent; robust long-horizon memory; content alignment with curricula.
- Film/TV virtual actors and scalable localization (Industry: media/entertainment)
- Vision: Digital performers generated from scripts and direction, across languages and markets, with contract-compliant likeness use.
- Sector tools/products: Previsualization to final render pipelines; studio-grade performance control UIs.
- Dependencies/assumptions: Union agreements and rights management; fine-grained creative control; high-fidelity 3D consistency.
- Real-time social robots and kiosks with world awareness (Industry: robotics, public services)
- Vision: On-device performance synchronized with physical gestures and environmental context.
- Sector tools/products: Service robots with expressive faces; smart counters in transport hubs.
- Dependencies/assumptions: Edge accelerators; reliable multimodal perception; safety in public spaces.
- Privacy-first consumer avatars for video calls and social media (Daily life)
- Vision: Consistent personal avatars that replace camera feeds while preserving natural social signaling.
- Sector tools/products: Cross-platform avatar layers for Zoom/Meet/Discord; mobile-optimized streaming.
- Dependencies/assumptions: Efficient on-device inference; standards for disclosure and watermarking; user acceptance.
- Sign language and multimodal accessibility interpreters (Sector: accessibility, public services)
- Vision: High-accuracy sign language generation and expressive translating avatars, leveraging full-body performance.
- Sector tools/products: Live interpretation on kiosks and broadcasts; training tools.
- Dependencies/assumptions: Specialized datasets and linguistic modeling; cultural/linguistic validation; full-body capture support.
- Standards, audits, and policy frameworks for AI performance (Policy/Standards)
- Vision: Benchmarks (e.g., LPM-Bench lineage) adopted for procurement, disclosure, and safety audits of interactive character systems.
- Sector tools/products: Certification suites; watermarking/traceability protocols; incident reporting pipelines.
- Dependencies/assumptions: Multi-stakeholder consensus; integration with deepfake policy and digital likeness rights; regulatory updates.
- Edge and mobile deployment at scale (Industry: hardware/software)
- Vision: Low-power, on-device streaming for consumer apps and IoT screens.
- Sector tools/products: Quantized/distilled LPM variants; hardware co-design with NPUs.
- Dependencies/assumptions: Model compression without quality loss; thermal and battery constraints; offline operation modes.
- Cross-cultural nonverbal communication libraries (Academia/Industry)
- Vision: Region-specific gesture/affect models to avoid cultural miscommunication.
- Sector tools/products: Locale-aware motion prompt packs; evaluation datasets per culture.
- Dependencies/assumptions: Diverse data collection with consent; expert annotation; ongoing calibration.
Notes on feasibility and dependencies
- Technical: Online LPM delivers low latency but still requires capable GPUs; pipeline relies on high-quality ASR/TTS and stable network streaming. Current system assumes single, camera-facing characters and lacks robust multi-view/3D grounding.
- Legal/ethical: Strict licenses for identity/voice; watermarking/disclosure; alignment and moderation for safe deployment; compliance with data privacy regulations.
- Operational: Monitoring for drift in long sessions; fallback behaviors for network jitter; QoS management for live deployments.
- Data: Identity-aware multi-reference inputs are crucial for stability; domain adaptation may be needed for accents, cultures, and environments not covered in training.
Glossary
- actor models: Unified models that jointly decide what to say and how to express it across time and modalities. "may give way to more unified actor models that jointly determine what is said, how it is expressed, and how behavior unfolds over time."
- addressee tracking: The capability to determine which participant a speaker is addressing in multi-party interaction. "such as addressee tracking, gaze allocation, and group-level turn-taking"
- causal streaming generator: A generator that produces outputs online with causal dependence on past inputs, enabling low-latency streaming. "and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction."
- Diffusion Transformer: A generative architecture combining diffusion processes with Transformer networks for high-quality synthesis. "train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning;"
- discourse-level memory: Mechanisms that maintain and use conversational context over extended interactions. "longer interactions will require discourse-level memory, persona persistence, and the ability to make current behavior coherent with prior events."
- full-duplex: The ability to handle simultaneous speaking and listening behaviors in conversation. "focusing on single-person full-duplex audio-visual conversational performance."
- gaze allocation: The regulation of eye gaze among interlocutors or targets in social interaction. "such as addressee tracking, gaze allocation, and group-level turn-taking"
- identity-aware multi-reference extraction: Collecting multiple reference cues tied to a specific identity to condition generation for consistency. "identity-aware multi-reference extraction;"
- identity-aware references: Reference inputs that explicitly encode a character’s identity for conditioning at inference time. "given a character image with identity-aware references,"
- identity-stable: Maintaining consistent character identity across long generated sequences. "all at real-time speed with identity-stable, infinite-length generation."
- long-horizon: Spanning long durations while preserving coherence and stability over time. "long-horizon video generation."
- LPM-Bench: A benchmark designed to evaluate interactive character performance comprehensively. "we propose LPM-Bench, the first benchmark for interactive character performance."
- micro-expressions: Subtle, rapid facial movements conveying fleeting emotions. "speaking, listening, micro-expressions, and natural motion"
- multimodal conditioning: Conditioning a generative model on multiple input modalities (e.g., audio, text, images) to control outputs. "through multimodal conditioning;"
- multi-party coordination: Managing behaviors and interactions among multiple participants simultaneously. "multi-party coordination"
- online stabilization: Real-time techniques to maintain temporal and visual consistency during streaming generation. "language generation, speech synthesis, audiovisual rendering, and online stabilization"
- performance trilemma: The tension among expressive quality, real-time inference, and long-horizon stability in performance modeling. "a tension we call the performance trilemma."
- persona persistence: Preserving a character’s enduring traits and style across extended interactions. "persona persistence"
- pipeline decomposition: Structuring the system as distinct stages (e.g., language, speech, rendering, stabilization) rather than a single monolith. "the current pipeline decomposition---language generation, speech synthesis, audiovisual rendering, and online stabilization---"
- scene geometry: The 3D spatial structure of an environment that informs how characters should behave and move. "must ground their behavior in scene geometry, objects, and contact."
- systems-level co-design: Coordinating design choices across data, models, and infrastructure to meet end-to-end goals. "admits a workable resolution through systems-level co-design."
- turn-taking: The social mechanism by which speakers alternate conversational turns, including group-level dynamics. "group-level turn-taking"
- visual fidelity: The perceived realism and quality of generated visuals. "maintaining visual fidelity across streaming and long-horizon video generation."
- world consistency: Maintaining a coherent and persistent representation of the 3D world across viewpoints and actions. "strong 3D and world consistency under arbitrary viewpoints and actions."
Collections
Sign up for free to add this paper to one or more collections.