EgoActing: First-Person Humanoid Control
- EgoActing is a first-person embodied control challenge that predicts concrete spatial actions from natural language instructions, egocentric observations, and low-level policies.
- The system employs Structured Language Actions (SLAs) for precise movement and Natural Language Actions (NLAs) for interaction, enabling coordinated locomotion and manipulation in complex settings.
- EgoActor architecture builds on Qwen3-VL with LoRA fine-tuning and leverages diverse data sources, including real-world demos, simulations, and spatial-reasoning tasks.
EgoActing denotes a first-person embodied-control problem in which a humanoid robot receives an actionable natural-language instruction and must predict the next concrete action from the instruction, egocentric observation history, past action history, and the set of available low-level policies, thereby grounding high-level instructions into “various, precise, spatially aware humanoid actions” (Bai et al., 4 Feb 2026). In adjacent literature, the term also appears more loosely in discussions of first-person action understanding, reaction generation, memory-grounded egocentric reasoning, and psychologically structured agent behavior, so the research area spans humanoid control, egocentric video modeling, synthetic-data generation, and multi-agent dialogue architectures (Chen et al., 9 Feb 2025).
1. Task formulation and action space
In its most explicit technical definition, EgoActing is formalized as
where is the natural-language task instruction, is the history of egocentric observations, is the past action history, is the set of available low-level whole-body and manipulation policies, and is the action space (Bai et al., 4 Feb 2026). The formulation emphasizes partial observability, RGB-only egocentric perception, and the need to coordinate locomotion, active perception, manipulation, and human-robot interaction in dynamic and cluttered environments.
A distinctive feature of EgoActing is its hybrid action space. EgoActor represents movement and active perception through Structured Language Actions (SLAs) such as Turn left 30.5 degrees, Look up 10.2 degrees, Move forward 0.26 meters, Left sidewalk 0.40 meters, Rise up 0.12 meters, and Lower down 0.08 meters. It represents manipulation and social behavior through Natural Language Actions (NLAs) such as Pick up the water bottle, Place the plate on the desk, Open the door, Ask "Where is the bathroom?", Say hi to the boy, and Stop and no action (Bai et al., 4 Feb 2026). This division is technically important: SLAs encode spatially parameterized egocentric control, whereas NLAs preserve flexibility for downstream manipulation and interaction modules.
The task therefore differs from conventional visual navigation or manipulation benchmarks. It does not stop at symbolic subgoals, nor does it assume that navigation and interaction can be solved independently. Instead, it requires immediate action serialization in language form from first-person RGB streams, with explicit distances, turning angles, head movements, and task-conditioned interaction outputs (Bai et al., 4 Feb 2026).
2. EgoActor architecture and supervision
EgoActor instantiates EgoActing as a unified vision-LLM built on Qwen3-VL and fine-tuned with LoRA applied to all linear layers (Bai et al., 4 Feb 2026). The model consumes a prompt containing an instruction, 10 sampled historical egocentric observations, and 3 recent observation-action pairs, with the third recent observation left without its action so that the model predicts the next action sequence. Recent observations are provided at 480p, historical observations at 240p, and inference uses stochastic sampling with temperature 0.2 (Bai et al., 4 Feb 2026).
The supervision recipe is deliberately broad. It combines egocentric RGB-only demonstrations, spatial reasoning question-answering, virtual-environment trajectories, planning corpora, general visual-language data, and on-policy experience. The largest EgoActing-specific source is EgoTaskQA, processed into 160,000 EgoActing training samples, supplemented by 130 additional internet-collected egocentric videos yielding 7,111 additional samples. The local real-world dataset contains 398 egocentric videos and yields 150,214 EgoActing training samples. The simulated EgoActing dataset contributes 76,821 EgoActing samples from 714 manually collected EgoActing-style trajectories in Habitat-Sim, split into 509 training trajectories and 205 validation trajectories from unseen environments. Additional sources include 44,160 MindCube spatial-reasoning samples, 300,000 GQA instances, 35,652 GPT-4o-annotated description samples from the local environment, 241,603 samples from RoboVQA, EgoPlan, and ALFRED, 10,575 unsupervised movement-prediction samples, and 3,629 EgoActing training samples from 70 successful traces of DAgger experience (Bai et al., 4 Feb 2026).
This supervision mix is designed to supply complementary capabilities rather than a single narrow policy prior. The egocentric demonstrations provide human-like movement patterns and interleaved structured and natural-language action sequences. The simulated trajectories provide precise instruction-observation-action alignment. MindCube strengthens explicit spatial reasoning. GQA and local descriptions preserve general visual-language understanding. RoboVQA, EgoPlan, and ALFRED add high-level planning structure. DAgger adapts the model to deployment-time states (Bai et al., 4 Feb 2026). A plausible implication is that EgoActing is being treated as a data-integration problem as much as a modeling problem.
3. Empirical performance and benchmark behavior
EgoActor is evaluated in both real-world and virtual environments against navigation-oriented VLM baselines, principally NaVid-7B, Uni-NaVid-7B, and the VLM component of NaVILA-7B (Bai et al., 4 Feb 2026). The central empirical pattern is that EgoActor improves most strongly on precision-critical embodied behavior: stopping at the right place, traversing narrow spaces, aligning for manipulation, and selecting the correct human for interaction.
On single-person human-robot interaction, EgoActor-4B achieved 12/12 on approach, 12/12 on say hi, 12/12 on ask for location, and 11/12 on request items; EgoActor-8B achieved 12/12 on all four, while NaVILA-7B achieved 2/12 on approach and NaVid-7B and UniNaVid-7B each achieved 8/12 on approach (Bai et al., 4 Feb 2026). On the multi-person OOD “Say Hi” benchmark, EgoActor-8B obtained 11/12 for clothing, 10/12 for accessories, 10/12 for posture, 12/12 for direction, and 11/12 for gender, exceeding EgoActor-4B’s 8/12, 7/12, 8/12, 11/12, and 10/12 respectively (Bai et al., 4 Feb 2026).
On mobile manipulation in unseen room layouts, EgoActor-8B achieved 5/6 and 6/6 on seen-object pick tasks, 6/6 and 6/6 on seen-object place tasks, and 5/6, 6/6, 4/6, and 5/6 on unseen-object pick/place tasks for the pen holder and pink cup. EgoActor-4B was consistently weaker, with a reported tendency to trigger manipulation too early, before being close enough to the object (Bai et al., 4 Feb 2026).
The most striking results are on traversability through narrow doorways. In seen environments, EgoActor-4B scored 11/12, 11/12, 12/12, and 10/12 on enter-left, enter-right, leave-left, and leave-right; EgoActor-8B scored 11/12, 12/12, 10/12, and 10/12. In unseen environments, EgoActor-4B scored 7/8 across all four entry/exit settings, while EgoActor-8B scored 7/8, 7/8, 8/8, and 7/8. By contrast, NaVILA-7B ranged from 1/8 to 3/8 in unseen environments, and NaVid-7B and UniNaVid-7B remained substantially less reliable for entering narrow spaces (Bai et al., 4 Feb 2026).
In the virtual benchmark built from 205 unseen-environment EgoActing samples, EgoActor-4B achieved 50.7% success within 0.5 m and 87.8% within 3.0 m, with NLA F1 0.60 and final view similarity 0.41; EgoActor-8B achieved 51.4% within 0.5 m and 89.9% within 3.0 m, with NLA F1 0.62 and the same final view similarity 0.41. Baselines remained near 6.3%–8.8% at the 0.5 m threshold and 51.7%–60.0% at 3.0 m, which indicates that standard VLN-style success thresholds obscure the precision demanded by EgoActing (Bai et al., 4 Feb 2026).
4. Relation to egocentric action and reaction modeling
Although EgoActing is most explicitly defined as a humanoid-robot task, it sits inside a broader methodological neighborhood concerned with first-person action prediction, action-object structure, viewpoint transfer, and causal reaction generation. EgoAgent proposes a “joint predictive agent model” that simultaneously learns to represent the world, predict future states, and take reasonable actions within a single transformer, using interleaved sequential modeling of states and actions with the causal attention mechanism and a joint embedding-action-prediction architecture featuring temporal asymmetric predictor-observer branches (Chen et al., 9 Feb 2025). This line is predictive rather than directly executable, but it supplies a state-action representation model that is highly relevant to EgoActing.
EgoReAct addresses a related yet distinct problem: generating 3D human reaction motion from egocentric video. It introduces the Human Reaction Dataset (HRD) and presents “the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time,” combining a Vector Quantised-Variational AutoEncoder, a Generative Pre-trained Transformer, and explicit incorporation of metric depth and head dynamics to enhance spatial grounding (Zhang et al., 28 Dec 2025). This work suggests a reaction-generation counterpart to EgoActing’s action-grounding agenda.
On the understanding side, EgoPrompt treats egocentric action recognition as a coupled verb-noun problem through a Unified Prompt Pool and Diverse Pool Criteria, rather than two independent classifiers (Lyu et al., 5 Aug 2025). EgoACO decomposes egocentric action clips into object, context, and action descriptors through Class Activation Pooling (CAP) and Long Short-Term Attention (LSTA) (Sudhakaran et al., 2021). EgoZAR introduces activity-centric zones as a domain-agnostic prior for cross-environment egocentric action recognition (Peirone et al., 2024). Earlier, “Actor and Observer: Joint Modeling of First and Third-Person Videos” introduced Charades-Ego, with 4000 paired videos involving 112 people, and learned a weakly supervised joint representation of first- and third-person video (Sigurdsson et al., 2018). Taken together, these works indicate that EgoActing inherits technical ingredients from action forecasting, reaction generation, structured HOI modeling, domain generalization, and actor-observer correspondence learning.
5. Synthetic data, memory, and long-horizon egocentric reasoning
A second major branch of the literature supplies infrastructure that can support EgoActing without defining the task directly. EgoGen is a synthetic data generator for egocentric perception in which a virtual human “directly leverages egocentric visual inputs of a virtual human to sense the 3D environment,” uses collision-avoiding motion primitives and a two-stage reinforcement learning approach, and eliminates the need for a pre-defined global path while remaining directly applicable to dynamic environments (Li et al., 2024). Its closed-loop coupling of body motion and head-mounted perception is a foundational capability for any first-person action-grounding system.
EgoInteract extends this synthetic-data direction toward object-centric interaction. It is a controllable simulator for egocentric video generation with 10,534 generated episodes, approximately 1.9 million frames, and dense annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection (Leonardi et al., 18 May 2026). The simulator uses a full-body SMPL-X avatar, collision-aware grasp generation, trajectory validation, and releases data in both Aria and GoPro formats. This suggests a direct route toward large-scale EgoActing pretraining under embodied interaction supervision.
Long-horizon memory and behavior-aware reasoning appear in AMEGO and EgoEverything. AMEGO constructs a semantic-free active memory from long egocentric videos by storing hand-object interaction tracklets and activity-centric locations, and evaluates them on the Active Memories Benchmark (AMB) with 20.5K multiple-choice QA pairs (Goletto et al., 2024). EgoEverything, by contrast, is an AR-oriented benchmark with over 5,000 multiple choice question answer pairs spanning more than 100 hours of video, and uses gaze-informed question generation through a Perception Sampler to reflect what the user actually attended to (Tang et al., 9 Apr 2026). A behaviorally richer multimodal variant appears in EgoBrain, which introduces 61 hours of synchronized 32-channel EEG and first-person video from 40 participants, and reports 66.70% Top-1 action accuracy on the 29-way task in the cross-subject + cross-scene setting (Lin et al., 2 Jun 2025). These resources collectively move EgoActing toward memory, attention, and latent-intention modeling rather than immediate reactive control alone.
6. Alternative usage: ego-state acting in language-agent systems
A distinct research line uses “EgoActing” in a psychological rather than egocentric-perceptual sense. In “On the Role of Contextual Information and Ego States in LLM Agent Behavior for Transactional Analysis Dialogues,” each dialogue agent is decomposed into Parent, Adult, and Child sub-agents, each with its own prompt and memory bank, and the final utterance is selected by a life-script-guided decision function
with ego-state-specific retrieval defined as
The paper reports that, under Memory ON, John’s Child selections increased from 10 to 15 turns and Taylor’s Parent selections increased from 8 to 18, suggesting that retrieval shifted the dominant internal mode of action (Zamojska et al., 18 Dec 2025).
A closely related architecture appears in Trans-ACT, which models each ego state as a ReAct agent within a LangGraph framework, gives each state a dedicated FAISS-backed memory store, and uses a decision-making agent to choose among Parent, Adult, and Child outputs according to relevance, progress toward resolution, social appropriateness, and life-script alignment (Zamojska et al., 28 Jul 2025). The Drama Machine proposes a different decomposition, using an outward-facing Ego and a hidden Superego that can rewrite the Ego model’s system prompt, rewrite the User’s queries, or review the Ego’s responses (Magee et al., 2024). These works do not address embodied egocentric control, but they establish a second, conceptually distinct meaning of EgoActing: behavior generated through internal ego-state competition and revision rather than through first-person visuomotor grounding.
7. Limitations and open problems
Across the robotics, video, and dialogue-agent lines, the central open problem is not definition but integration. In the humanoid setting, EgoActor still depends on external components such as low-level locomotion policies, pre-trained VLA manipulation models, and interaction execution modules, and the authors note that the model may “occasionally fall into locally optimal but incorrect decision patterns when navigating extended or multi-stage tasks” (Bai et al., 4 Feb 2026). In predictive egocentric modeling, EgoAgent explicitly lacks finger movements and long-term memory, which constrains direct use for detailed object manipulation or prolonged task execution (Chen et al., 9 Feb 2025).
Simulation-based approaches remain incomplete. EgoGen focuses on locomotion, obstacle avoidance, and coarse attention rather than hand manipulation, sitting, lying, detailed object interaction, or social behavior (Li et al., 2024). EgoInteract demonstrates broad cross-task transfer, but the paper reports limited formal ablations, leaving the separate contributions of environment diversity, grasp validation, IK fidelity, and annotation design less precisely quantified (Leonardi et al., 18 May 2026). In the psychological-agent line, evaluation is still mainly qualitative, arbitration remains opaque, and exact prompts, top-, embedding models, and selection rubrics are not fully specified (Zamojska et al., 18 Dec 2025).
These limitations suggest that EgoActing is best viewed not as a solved task but as a convergence point. One branch seeks precise spatial grounding from egocentric RGB into executable humanoid actions; another develops predictive state-action models for first-person worlds; a third builds synthetic interaction data and long-context memory benchmarks; and a fourth explores internal ego-state architectures for psychologically grounded behavior. A plausible implication is that future EgoActing systems will need all four: egocentric perception, causal action prediction, scalable embodied data, and structured internal decision processes.