Papers
Topics
Authors
Recent
Search
2000 character limit reached

LPM 1.0: Video-based Character Performance Model

Published 9 Apr 2026 in cs.CV, cs.AI, and cs.MM | (2604.07823v1)

Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

Summary

  • The paper introduces LPM 1.0, a novel system that resolves the performance trilemma by ensuring expressiveness, real-time operation, and long-horizon identity preservation.
  • It leverages a 17B-parameter Diffusion Transformer and a distilled causal streaming generator to synchronize audiovisual cues and maintain behavioral coherence.
  • Extensive benchmarking via LPM-Bench confirms state-of-the-art performance in visual fidelity, latency, and interactive conversational engagement.

LPM 1.0: A Multimodal System for Conversational Character Performance

Introduction

LPM 1.0 introduces a novel approach to audiovisual character performance modeling, targeting high-fidelity, identity-consistent, and temporally stable video generation for conversational agents. The authors identify the "performance trilemma," which characterizes the existing trade-off between expressiveness, real-time operation, and long-horizon identity preservation. Their central premise is that performance realism in conversational avatars must be holistically solved across data, conditioning, model architecture, streaming, and stabilization rather than isolated architectural improvements. LPM 1.0 is designed as a visual engine underpinning virtual agents, live-streaming digital characters, and interactive game NPCs, and provides state-of-the-art results along the axes of expressiveness, latency, and stability. Figure 1

Figure 1: LPM 1.0 generates identity-consistent conversational video with synchronized verbal and non-verbal behaviors---speaking, listening, micro-expressions, and natural motion---while maintaining visual fidelity across streaming and long-horizon video generation.

System and Model Architecture

LPM 1.0 consists of two principal models: a 17B-parameter Diffusion Transformer (Base LPM) and a distilled causal streaming generator (Online LPM) for low-latency, infinite-length production. Multimodal conditioning is a primary focus, leveraging identity-aware references, aligned audio-video pairs, and prompt-based motion control. The pipeline capitalizes on comprehensive curation of human-centric, conversational data, incorporating strict filtering to maximize naturalistic and expressive behavior capture. Identity-stable references and controllable generation are ensured by integrating multi-reference video, text prompt, and multimodal fusion into the architecture.

Base LPM is trained for highly controllable and identity-locked video synthesis; it is subsequently distilled into a deployment-suitable, causal generator that supports streaming, real-time interaction, and unbounded generation horizons without drift or personification loss. The resulting Online LPM is optimized for latency and system-level throughput.

Dataset Design and Benchmarking

A key innovation is the large-scale multimodal dataset construction, tailored for the target use case of conversational character performance. The dataset is curated through a pipeline of audio-video synchrony verification, identity-aware annotation, and speaking-listening data pairing, enabling the model to learn both verbal and non-verbal modes of expression, micro-expressions, and natural head/eye movements. Data curation explicitly targets behavioral legibility in a social context, rather than generic speech-driven animation.

Evaluation is systematically formalized via LPM-Bench, a bespoke benchmark to assess interactive character performance holistically. LPM-Bench transcends traditional metrics such as lip synchronization or image quality, instead operationalizing the social legibility and engagement of the generated characters, their behavioral coherence, and their ability to sustain believable interaction over extended discourse.

Strong Empirical Results

LPM 1.0 is reported to achieve state-of-the-art performance across multiple axes as defined by LPM-Bench: expressiveness, real-time inference, and long-horizon video generation, all evaluated on identity preservation and behavioral consistency. The model achieves real-time streaming rates while maintaining high visual fidelity, synchronized verbal and non-verbal expression, and robust identity stability even across long interactions. These results indicate resolution of the performance trilemma under current scenario constraints.

Limitations

Despite its practical efficacy, LPM 1.0 has clear scope bounds. It is restricted to single, camera-facing interlocutors without requirements for environmental grounding or multi-agent coordination. Discourse-level memory, persistent persona, adaptive turn-taking, or continuous world consistency are not addressed. The decomposed pipeline---spanning language generation, speech synthesis, and audiovisual rendering---limits naturalness in scenarios requiring unified, cross-modal policy or deep environmental/physical grounding.

Future Directions and Implications

Extending LPM 1.0 along three major axes is proposed: temporal (support for long-horizon memory and persona), social (multi-party, addressee-aware, and group interaction), and physical (behavior grounded in dynamic scenes and interaction with objects). The authors anticipate a paradigm shift toward unified actor models capable of end-to-end determination of verbal, paraverbal, and non-verbal behavior, yielding more holistic, context-sensitive, and continuous characters. Methodological implications include a move from modular pipeline composition to joint modeling of language, audio, and video, where streaming and stability are emergent rather than post-hoc effects.

The system's current utility for real-time, high-quality agent visualizations in live-streaming, gaming, and conversational AI underscores its practical impact. Ongoing research must address problems of interactive memory, multi-agent orchestration, and embodied simulation to achieve generalizable, fully-situated interactive character models.

Conclusion

LPM 1.0 represents a scalable, multimodal approach to video-based character performance modeling that operationalizes identity-consistent, expressive, and temporally stable audiovisual character generation. By systemically addressing the performance trilemma, the model demonstrates state-of-the-art character interaction in real-time while motivating future integrated research on long-horizon, socially-intelligent, and physically-embedded AI actors.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What this paper is about

This paper introduces LPM 1.0 (Large Performance Model), an AI system that can turn a single picture of a person into a long, realistic video of that person having a conversation—speaking, listening, reacting, and showing emotions—in real time. The goal is to make on-screen characters feel like real, attentive conversational partners, not just moving mouths.

The big questions the researchers asked

  • How can we make a character’s performance feel expressive and natural (not just lip sync), while still running fast enough for live conversations?
  • How do we keep the character’s identity consistent over long videos, so the face and style don’t drift or change?
  • Can one system handle both sides of a conversation—speaking and listening—at the same time, and do it forever without breaking?

The authors call this the “performance trilemma”: balancing expressiveness, real-time speed, and long-term stability all at once.

How they approached the problem

To solve the trilemma, the team designed the whole system end-to-end, not just a single model. Here’s the approach in simple terms:

  • Data built for conversation: They carefully collected and filtered video and audio of people talking and listening. They paired “speaking” clips with matching “listening” clips, added descriptions about performance (like emotions and reactions), and gathered multiple reference images of each person to lock in identity. Think of it like giving the AI a scrapbook of a person plus examples of how people behave in real conversation.
  • A large “Base” model that learns performance: They trained a very big AI model (about 17 billion parameters) called a Diffusion Transformer.
    • Diffusion, simplified: Imagine starting with a noisy, blurry video and gradually “cleaning it up,” guided by clues like voice, text, and identity images. That’s how the model creates realistic motion frame by frame.
    • Transformer: A type of AI good at paying attention to many signals at once (voice, text, images) and using context over time.
    • Multimodal conditioning: The model takes in different inputs—audio for timing and tone, text prompts for motion control, and reference images for identity—and combines them to generate the performance.
  • Teaching a smaller, faster “Online” model: They “distilled” the big model into a lighter, streaming version.
    • Distillation: Like a student learning from a teacher—copying the teacher’s behavior to become quicker at test time.
    • Causal streaming generator: It makes video one moment at a time in order, so it can run live with low delay, and can keep going as long as needed.
  • Real-time, full-duplex conversation: At run time, you give the system:
    • A character image and identity references,
    • Audio from the user to generate listening reactions,
    • Synthesized speech audio to generate speaking behavior,
    • Optional text prompts to nudge motions or style (for example, “more energetic gestures”).
    • The system then creates smooth, identity-consistent video of the character both listening and speaking, with natural timing and expressions, at real-time speed.
  • A new test called LPM-Bench: They built a benchmark to measure how well systems perform in interactive character settings—tracking things like lip sync, expressiveness, reaction timing, and identity stability across long videos.

What they found and why it matters

  • It resolves the “performance trilemma” in practice: By co-designing the data, model, and streaming system together, LPM 1.0 can be expressive, fast, and stable over long periods—at the same time.
  • State-of-the-art results: On their new LPM-Bench and other tests, LPM 1.0 outperforms previous systems across key measures, while still running in real time.
  • Identity stays consistent: Even during long, continuous generation, the character’s face and overall look don’t drift.
  • Natural conversational behavior: The system doesn’t just move lips; it listens, reacts, shows micro-expressions, and times responses in a socially believable way.

This matters because it turns video generation into a reliable “visual engine” for interactive characters—useful for virtual assistants, live stream avatars, and game NPCs that need to feel present and responsive.

What it could lead to next

  • Short term: Better virtual presenters, streamers, and game characters that can look and feel engaged, not just animated.
  • Longer term: The authors point to three growth areas:
    • Time: Remembering what happened earlier so a character’s behavior stays consistent across long conversations and stories.
    • Social: Handling multi-person chats—tracking who’s talking, who’s being addressed, and how to share turns smoothly.
    • Physical: Making characters aware of 3D spaces, objects, and actions so their behavior fits the environment.

They also note limits: LPM 1.0 currently focuses on a single, camera-facing person and doesn’t deeply track long-term memories or complex environments yet. But it shows that high-quality, real-time, long-running conversational video is possible today, paving the way for richer, more believable interactive characters.

Knowledge Gaps

Below is a single, consolidated list of concrete gaps, limitations, and open questions that remain unresolved in the paper. These are framed to guide future research and experimentation.

  • Generalization beyond single, camera-facing characters: how to handle arbitrary viewpoints, moving cameras, occlusions, and full-body actions without losing identity stability or real-time performance.
  • Multi-party interaction: methods for addressee detection, gaze allocation, overlap handling, interruption/barge-in management, and group-level turn-taking in full-duplex settings.
  • Long-horizon discourse memory: mechanisms for persistent persona, recall of prior events across sessions, and behavioral coherence over hours-long interactions (including drift detection and correction).
  • Physical/world grounding: integrating 3D scene understanding, object interactions, contact dynamics, and consistent behavior under environment changes.
  • End-to-end actor modeling: feasibility and benefits of replacing the current pipeline (LLM → TTS → video generation → stabilization) with unified models that jointly decide what to say, how to say it, and how to perform.
  • Robustness to real-world audio: performance under background noise, reverberation, variable microphones, latency jitter, packet loss, and cross-talk in live scenarios.
  • Language and accent coverage: support and evaluation across languages, accents, code-switching, and speech disorders, including prosody, phoneme coverage, and lip–audio synchronization fidelity.
  • Listening vs speaking asymmetry: the system generates listening video from user audio but speaking video from synthesized audio—can it generate high-quality speaking video from arbitrary user-recorded speech, and what are the trade-offs?
  • Expressive control granularity: limits of text-prompted motion control; how to specify fine-grained timing, intensity, and multimodal constraints when audio and text prompts conflict.
  • Emotion and affect modeling: reliability of emotion/micro-expression control, temporal consistency of affect, and disentanglement between identity, style, and emotional state.
  • Identity-aware references: required quantity, diversity (poses/expressions), and quality of references; robustness to low-quality or mismatched references; automatic reference selection strategies and their effect on identity stability.
  • Identity leakage and memorization: risk of training–test identity overlap, face memorization, and unintended identity reconstruction; protocols to measure and mitigate identity leakage.
  • Infinite-length generation claims: quantitative evidence for stability over very long sessions (e.g., multi-hour) and strategies for preventing cumulative drift or artifacts.
  • Distillation trade-offs: formal characterization of quality/latency trade-offs between the 17B Base LPM and the Online LPM; boundaries where streaming degradation becomes unacceptable.
  • Hardware and scalability: resource requirements (GPU type, memory), throughput/latency scaling under different hardware, energy consumption, and feasibility on edge or mobile devices.
  • Failure modes and recovery: detection and recovery from desynchronization, identity drift, facial/pose glitches, and temporal artifacts during live streaming without restarting the session.
  • Dataset transparency: detailed composition, licensing, consent, demography, languages, recording conditions, and potential biases; whether the dataset is released and how others can reproduce or audit it.
  • Data sourcing and filtering: validation of “strict filtering” and “speaking–listening pairing” procedures; label quality, noise rates, and their impact on downstream performance.
  • Cross-cultural nonverbal behavior: coverage and evaluation of culturally specific gestures, gaze norms, and interpersonal distance cues; methods for culture-aware performance.
  • LPM-Bench validity: public release, task and metric definitions, inter-rater reliability of human evaluations, correlation with user satisfaction, and robustness to gaming.
  • Out-of-distribution generalization: performance on identities, styles, camera setups, and environments not seen in training; stress tests for extreme poses, lighting, occlusions, and attire.
  • Safety and misuse prevention: concrete mechanisms for consent verification, anti-impersonation safeguards, watermarking, detection of synthetic content, and abuse mitigation in real-time systems.
  • Ethical personalization: controls for persona shaping without stereotyping; user agency and transparency when adapting style/emotion to user preferences or histories.
  • Integration with conversational reasoning: aligning performance timing and affect with LLM reasoning latency; strategies for anticipation and backchanneling when language generation is uncertain or delayed.
  • Evaluation of social legibility: standardized measures for listening behavior, timing of nods/backchannels, and appropriateness of reactions; cross-comparison with human baselines.
  • Learning from interaction: online adaptation or reinforcement learning from user feedback to improve timing, expressivity, and social cues without destabilizing identity or safety.
  • Compositional control: combining multiple constraints (text, audio prosody, explicit motion cues) at once; conflict resolution and priority schemes with predictable outcomes.
  • Privacy guarantees: formal privacy analyses (e.g., PII leakage, membership inference) for both training data and user-provided media during live sessions.
  • Content moderation: real-time detection and handling of unsafe, copyrighted, or disallowed content in audio prompts, generated speech, and video performance.
  • Reproducibility and openness: clarity on the availability of code, models, and benchmarks; seeds and protocols for replicable latency and quality measurements across labs.

Practical Applications

Immediate Applications

Below are practical, deployable use cases that can be implemented with LPM 1.0 as described (single, camera-facing character; identity-stable, real-time, full‑duplex audio-visual conversation; multimodal conditioning; Online LPM for streaming).

  • Customer support and sales avatars (Industry: software, retail, telecom)
    • Use case: Real-time, on-website or in-app customer service reps that speak and react with natural micro-expressions and attentive listening while the user talks.
    • Workflow/product: LLM (dialogue) → TTS (agent speech) → Online LPM (speaking video) + user audio → Online LPM (listening video); delivered via WebRTC in contact center platforms (e.g., Genesys, Twilio Flex).
    • Dependencies/assumptions: Reliable low-latency TTS/ASR; GPU inference for real-time; licensed identity/likeness; single-person, camera-facing framing.
  • Virtual presenters for e-commerce and marketing (Industry: advertising, e-commerce)
    • Use case: Brand avatars that deliver product demos, promotions, and FAQs with consistent identity and expressive behavior.
    • Workflow/product: CMS + promptable motion control (text prompts) to align gestures with brand guidelines; LPM powers visual layer.
    • Dependencies/assumptions: Scripted content pipeline; style-safe motion prompts; brand compliance and watermarking.
  • VTubers and live-streaming co-hosts (Industry: media/entertainment)
    • Use case: Automated co-hosts or digital doubles that can listen, react, and present content live; creators maintain persona without being on camera.
    • Workflow/product: OBS/VTuber suite plugin using Online LPM; integrates chat-driven LLM, TTS, and face-reference conditioning for identity stability.
    • Dependencies/assumptions: Adequate compute on creator’s PC or cloud; identity/voice rights; moderation tools.
  • Dynamic game NPCs with in-session dialogue (Industry: gaming)
    • Use case: In-game characters that react to player speech and deliver lines with synchronized expressions in cutscenes or dialogue hubs.
    • Workflow/product: Unreal/Unity plugin; LLM for branching dialogue + TTS → LPM for NPC face/body performance; streamed to players.
    • Dependencies/assumptions: Mostly frontal shots or kiosk-like NPCs; latency budgets for cloud or edge GPUs; limited viewpoint changes.
  • Education: conversational tutors and language partners (Sector: education)
    • Use case: On-demand tutors that maintain identity, provide feedback, and model conversational norms (turn-taking, backchannels) in real time.
    • Workflow/product: LMS integration; curriculum-aligned prompts for motion; real-time listening behaviors to encourage learner engagement.
    • Dependencies/assumptions: Age-appropriate safety filters; culturally sensitive nonverbal behaviors; institutional approval.
  • Dubbing and ADR for postproduction (Industry: film/TV, localization)
    • Use case: Generate lip-synchronized, identity-consistent talking footage from dubbed audio for reshoots or multi-language releases.
    • Workflow/product: Audio (dubbed) → LPM speaking video; multi-reference identity inputs to match actor; batch or near-real-time render.
    • Dependencies/assumptions: Rights from actors/unions; high-quality voice tracks; camera-facing constraints; postproduction QC.
  • Corporate training simulations (Industry: HR, L&D)
    • Use case: Scenario-based training (e.g., negotiation, customer care) with lifelike interlocutors showing appropriate affect and reactions.
    • Workflow/product: LPM-driven avatars integrated in scenario engines; evaluation rubrics track learner interaction.
    • Dependencies/assumptions: Curated scripts; safety guardrails; GPU serving costs.
  • Telepresence with privacy-preserving avatars (Industry: enterprise software)
    • Use case: Replace user’s live video with a consistent avatar that mirrors speech and basic expressions while masking identity.
    • Workflow/product: Video conferencing plugin; user audio → LPM speaking/listening videos; optional style identity template.
    • Dependencies/assumptions: Consent from participants; bandwidth matching; robustness to varied acoustics.
  • Digital concierges and kiosks (Industry: hospitality, retail, public services)
    • Use case: On-prem interactive screens with natural attendants that listen and respond with socially legible behaviors.
    • Workflow/product: Edge device with GPU; ASR/TTS stack; multilingual prompts; on-device LPM streaming.
    • Dependencies/assumptions: Edge compute and thermal budgets; noise-robust ASR; adherence to accessibility standards.
  • Human–computer interaction research and benchmarking (Academia/Industry R&D)
    • Use case: Evaluate agents’ turn-taking, gaze, and affect using LPM-Bench; study nonverbal behaviors and user perception.
    • Workflow/product: Adopt LPM-Bench as a standardized evaluation suite; extend with task-specific metrics.
    • Dependencies/assumptions: Reproducible test harness; IRB approvals for user studies where needed.
  • Accessibility and social skills rehearsal (Sector: healthcare/education)
    • Use case: Role-play partners for practicing interviews, presentations, or social cues with responsive feedback.
    • Workflow/product: Therapy or coaching apps integrating LPM avatars; session scripts + motion prompts.
    • Dependencies/assumptions: Clinical oversight; content sensitivity; data privacy.
  • Developer tools and integrations (Software tooling)
    • Use case: SDKs/plugins for Unreal/Unity, OBS, and Web APIs to embed LPM streaming in products.
    • Workflow/product: Inference servers with REST/WebSocket; prebuilt workflows for LLM+TTS+LPM; monitoring dashboards.
    • Dependencies/assumptions: GPU autoscaling; SLA targets; compliance logging/watermarking.

Long-Term Applications

These opportunities require further research, scaling, or system integration beyond LPM 1.0’s current scope (e.g., multi-party interaction, 3D/world grounding, long-horizon memory).

  • Unified “actor models” that co-decide content and performance (Industry: software, media)
    • Vision: A single model that plans dialogue, voice, and audiovisual performance jointly over time.
    • Sector tools/products: End-to-end agent frameworks replacing separate LLM/TTS/LPM modules; story-driven performance engines for games and film.
    • Dependencies/assumptions: Large-scale multimodal training; controllable long-horizon narrative memory; safety and interpretability.
  • Multi-party, group-aware conversational agents (Industry: collaboration, education)
    • Vision: Agents handling addressee tracking, gaze allocation, and group turn-taking in meetings and classrooms.
    • Sector tools/products: Meeting assistants that coordinate with multiple participants; group tutoring avatars.
    • Dependencies/assumptions: Social dynamics datasets; evaluation frameworks for group interaction; improved real-time gaze/navigation control.
  • 3D/scene-grounded embodied characters (Industry: robotics, AR/VR, gaming)
    • Vision: Characters that understand and act within 3D environments, with consistent behavior from arbitrary viewpoints and during movement/contact.
    • Sector tools/products: AR assistants; robot faces/screens synchronized with physical actions; VR telepresence with world-locked avatars.
    • Dependencies/assumptions: Strong 3D priors, multi-view training, body/scene contact modeling; latency-robust sensor fusion.
  • Telemedicine and behavioral health avatars (Sector: healthcare)
    • Vision: Empathetic clinician avatars with long-term persona persistence and session-spanning memory.
    • Sector tools/products: Intake triage, adherence coaching, CBT role-play with regulated oversight.
    • Dependencies/assumptions: Clinical validation, regulatory compliance (HIPAA/GDPR), bias mitigation, crisis-handling protocols.
  • Personalized, persistent education companions (Sector: education)
    • Vision: Lifelong learning avatars that remember learner history and adapt teaching style with consistent persona and nonverbal scaffolding.
    • Sector tools/products: District-wide deployments integrated with SIS/LMS; analytics for engagement.
    • Dependencies/assumptions: Data privacy and consent; robust long-horizon memory; content alignment with curricula.
  • Film/TV virtual actors and scalable localization (Industry: media/entertainment)
    • Vision: Digital performers generated from scripts and direction, across languages and markets, with contract-compliant likeness use.
    • Sector tools/products: Previsualization to final render pipelines; studio-grade performance control UIs.
    • Dependencies/assumptions: Union agreements and rights management; fine-grained creative control; high-fidelity 3D consistency.
  • Real-time social robots and kiosks with world awareness (Industry: robotics, public services)
    • Vision: On-device performance synchronized with physical gestures and environmental context.
    • Sector tools/products: Service robots with expressive faces; smart counters in transport hubs.
    • Dependencies/assumptions: Edge accelerators; reliable multimodal perception; safety in public spaces.
  • Privacy-first consumer avatars for video calls and social media (Daily life)
    • Vision: Consistent personal avatars that replace camera feeds while preserving natural social signaling.
    • Sector tools/products: Cross-platform avatar layers for Zoom/Meet/Discord; mobile-optimized streaming.
    • Dependencies/assumptions: Efficient on-device inference; standards for disclosure and watermarking; user acceptance.
  • Sign language and multimodal accessibility interpreters (Sector: accessibility, public services)
    • Vision: High-accuracy sign language generation and expressive translating avatars, leveraging full-body performance.
    • Sector tools/products: Live interpretation on kiosks and broadcasts; training tools.
    • Dependencies/assumptions: Specialized datasets and linguistic modeling; cultural/linguistic validation; full-body capture support.
  • Standards, audits, and policy frameworks for AI performance (Policy/Standards)
    • Vision: Benchmarks (e.g., LPM-Bench lineage) adopted for procurement, disclosure, and safety audits of interactive character systems.
    • Sector tools/products: Certification suites; watermarking/traceability protocols; incident reporting pipelines.
    • Dependencies/assumptions: Multi-stakeholder consensus; integration with deepfake policy and digital likeness rights; regulatory updates.
  • Edge and mobile deployment at scale (Industry: hardware/software)
    • Vision: Low-power, on-device streaming for consumer apps and IoT screens.
    • Sector tools/products: Quantized/distilled LPM variants; hardware co-design with NPUs.
    • Dependencies/assumptions: Model compression without quality loss; thermal and battery constraints; offline operation modes.
  • Cross-cultural nonverbal communication libraries (Academia/Industry)
    • Vision: Region-specific gesture/affect models to avoid cultural miscommunication.
    • Sector tools/products: Locale-aware motion prompt packs; evaluation datasets per culture.
    • Dependencies/assumptions: Diverse data collection with consent; expert annotation; ongoing calibration.

Notes on feasibility and dependencies

  • Technical: Online LPM delivers low latency but still requires capable GPUs; pipeline relies on high-quality ASR/TTS and stable network streaming. Current system assumes single, camera-facing characters and lacks robust multi-view/3D grounding.
  • Legal/ethical: Strict licenses for identity/voice; watermarking/disclosure; alignment and moderation for safe deployment; compliance with data privacy regulations.
  • Operational: Monitoring for drift in long sessions; fallback behaviors for network jitter; QoS management for live deployments.
  • Data: Identity-aware multi-reference inputs are crucial for stability; domain adaptation may be needed for accents, cultures, and environments not covered in training.

Glossary

  • actor models: Unified models that jointly decide what to say and how to express it across time and modalities. "may give way to more unified actor models that jointly determine what is said, how it is expressed, and how behavior unfolds over time."
  • addressee tracking: The capability to determine which participant a speaker is addressing in multi-party interaction. "such as addressee tracking, gaze allocation, and group-level turn-taking"
  • causal streaming generator: A generator that produces outputs online with causal dependence on past inputs, enabling low-latency streaming. "and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction."
  • Diffusion Transformer: A generative architecture combining diffusion processes with Transformer networks for high-quality synthesis. "train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning;"
  • discourse-level memory: Mechanisms that maintain and use conversational context over extended interactions. "longer interactions will require discourse-level memory, persona persistence, and the ability to make current behavior coherent with prior events."
  • full-duplex: The ability to handle simultaneous speaking and listening behaviors in conversation. "focusing on single-person full-duplex audio-visual conversational performance."
  • gaze allocation: The regulation of eye gaze among interlocutors or targets in social interaction. "such as addressee tracking, gaze allocation, and group-level turn-taking"
  • identity-aware multi-reference extraction: Collecting multiple reference cues tied to a specific identity to condition generation for consistency. "identity-aware multi-reference extraction;"
  • identity-aware references: Reference inputs that explicitly encode a character’s identity for conditioning at inference time. "given a character image with identity-aware references,"
  • identity-stable: Maintaining consistent character identity across long generated sequences. "all at real-time speed with identity-stable, infinite-length generation."
  • long-horizon: Spanning long durations while preserving coherence and stability over time. "long-horizon video generation."
  • LPM-Bench: A benchmark designed to evaluate interactive character performance comprehensively. "we propose LPM-Bench, the first benchmark for interactive character performance."
  • micro-expressions: Subtle, rapid facial movements conveying fleeting emotions. "speaking, listening, micro-expressions, and natural motion"
  • multimodal conditioning: Conditioning a generative model on multiple input modalities (e.g., audio, text, images) to control outputs. "through multimodal conditioning;"
  • multi-party coordination: Managing behaviors and interactions among multiple participants simultaneously. "multi-party coordination"
  • online stabilization: Real-time techniques to maintain temporal and visual consistency during streaming generation. "language generation, speech synthesis, audiovisual rendering, and online stabilization"
  • performance trilemma: The tension among expressive quality, real-time inference, and long-horizon stability in performance modeling. "a tension we call the performance trilemma."
  • persona persistence: Preserving a character’s enduring traits and style across extended interactions. "persona persistence"
  • pipeline decomposition: Structuring the system as distinct stages (e.g., language, speech, rendering, stabilization) rather than a single monolith. "the current pipeline decomposition---language generation, speech synthesis, audiovisual rendering, and online stabilization---"
  • scene geometry: The 3D spatial structure of an environment that informs how characters should behave and move. "must ground their behavior in scene geometry, objects, and contact."
  • systems-level co-design: Coordinating design choices across data, models, and infrastructure to meet end-to-end goals. "admits a workable resolution through systems-level co-design."
  • turn-taking: The social mechanism by which speakers alternate conversational turns, including group-level dynamics. "group-level turn-taking"
  • visual fidelity: The perceived realism and quality of generated visuals. "maintaining visual fidelity across streaming and long-horizon video generation."
  • world consistency: Maintaining a coherent and persistent representation of the 3D world across viewpoints and actions. "strong 3D and world consistency under arbitrary viewpoints and actions."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 215 likes about this paper.