Agent-Omni: Unified Multimodal Agency Research

Updated 3 July 2026

Agent-Omni is a research paradigm for unified multimodal agency characterized by selective perception, structured memory, and explicit action coordination.
Key architectural patterns include modular decomposition, explicit state structuring, and asymmetric parameter sharing to manage diverse modalities.
Empirical evaluations highlight trade-offs in latency, accuracy, and resource efficiency, underscoring challenges in real-world multimodal integration.

Searching arXiv for the cited "Agent-Omni" and closely related omni-agent papers to ground the article in current preprints. arXiv search query: all:"Agent-Omni" OR ti:"Agent-Omni" Agent-Omni is a research label used in recent arXiv literature for systems that seek unified multimodal agency across perception, reasoning, memory, retrieval, and action. In the available corpus, the term does not denote a single canonical architecture. Instead, it appears across several related but distinct formulations: an end-to-end multimodal conversational pipeline in OpenOmni (Sun et al., 2024), a master-agent framework for test-time coordination of specialized foundation models (Lin et al., 4 Nov 2025), a generalist policy spanning GUI and embodied control in OmniActor (Yang et al., 2 Sep 2025), a hierarchical memory system for personalized long-horizon interaction in O-Mem (Wang et al., 17 Nov 2025), a mobile perception–memory–action loop in X-OmniClaw (Ren et al., 7 May 2026), and multiple active-perception frameworks for audio-video reasoning and tool use (Zhu et al., 3 Feb 2026, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Li et al., 26 Feb 2026, Xing et al., 17 Jun 2026). This suggests that “Agent-Omni” functions less as a standardized product name than as a design paradigm for agents intended to operate across heterogeneous modalities, environments, and temporal horizons.

1. Terminological scope and conceptual identity

The literature uses the Agent-Omni label for systems that differ in substrate and objective. Some are primarily conversational and benchmark-oriented, some are test-time orchestration frameworks, some are memory systems, and some are action agents operating in mobile, GUI, robotic, or long-video settings.

System	Primary focus	Distinctive mechanism
OpenOmni	Multimodal conversational pipeline	Five-module end-to-end framework
OmniActor	GUI and embodied control	Layer-heterogeneity MoE
O-Mem	Personalized long-horizon memory	Active user profiling + hierarchical retrieval
X-OmniClaw	Mobile-native interaction	Perception–Memory–Action on-device loop
Agent-Omni	Test-time multimodal reasoning	Master-agent coordination without retraining
OmniRAG-Agent / OmniAgent / AOP-Agent / OmniAtlas	Audio-video active perception and tool use	Planning loops, retrieval, reflection, or active sensing

OpenOmni explicitly presents “Agent-Omni” as the OpenOmni framework for building and benchmarking multimodal conversational agents, organized into Client, API, Storage, Agent, and User Interface modules (Sun et al., 2024). By contrast, the paper titled “Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything” defines Agent-Omni as a master-agent system that decomposes a multimodal request into subtasks, assigns them to modality-specific agents, and integrates their outputs into a final textual response (Lin et al., 4 Nov 2025). Other works use the label more broadly to describe a unified agentic paradigm spanning perception, memory, and action on smartphones, or active omni-modal perception over long audio-video inputs (Ren et al., 7 May 2026, Xu et al., 27 May 2026).

A plausible implication is that the term has become a shorthand for a family of omni-modal agent designs rather than a single lineage. The common denominator is not a fixed implementation, but an attempt to combine modality integration with explicit agency: planning, tool invocation, memory updates, or environment interaction.

2. Recurrent architectural patterns

Despite the heterogeneity of implementations, several architectural motifs recur. One is modular decomposition. OpenOmni exposes Speech-to-Text, Emotion Detection, Retrieval-Augmented Generation, LLM integration, and Text-to-Speech as interchangeable components behind standard interfaces, with local and cloud deployment options and benchmarking over latency, accuracy, cost, and privacy (Sun et al., 2024). Agent-Omni similarly decomposes the system into a User Intent Interpreter, Subtask Scheduler, Modality Adapters, Agent Pool, and Response Assembler, with the master agent coordinating specialized models rather than retraining a monolithic omni-model (Lin et al., 4 Nov 2025).

A second motif is explicit state structuring. X-OmniClaw is organized around Omni Perception, Omni Memory, and Omni Action, with Working Memory for task continuity and Long-Term Memory distilled from local data (Ren et al., 7 May 2026). O-Mem separates Persona Memory, Working Memory, and Episodic Memory, each with different update and retrieval rules (Wang et al., 17 Nov 2025). AOP-Agent distinguishes a global summary, mid-level segments, fine-grained clips, working memory, and evidence memory in a hierarchical omni-modal memory stack (Xu et al., 27 May 2026).

A third motif is asymmetric parameter sharing. OmniActor does not treat all modalities as uniformly compatible. It reports that GUI and embodied data show synergy in shallow layers and conflict in deep layers, motivating a Layer-heterogeneity MoE in which shallow layers are shared and deep layers are separated into GUI and robot experts (Yang et al., 2 Sep 2025). This differs from master-agent systems that keep models separate and coordinate them externally, but both approaches attempt to preserve modality-specific competence while exploiting shared structure.

Formally, these systems often introduce an explicit decomposition between understanding and execution. OmniActor seeks a single policy $\pi$ that maps an image $I$ , task instruction $T$ , and optional textual history $H$ to action tokens $A$ , with each action token tagged as either GUI or robot (Yang et al., 2 Sep 2025). Agent-Omni defines a decomposition $t_i \in T = f_{\mathrm{decompose}}(U)$ , assignment $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ , execution $o_i = M_{j(i)}(t_i)$ , and final integration $O = f_{\mathrm{integrate}}(\{o_i\})$ (Lin et al., 4 Nov 2025). Native OmniAgent further casts long-form omni-modal understanding as a POMDP with sensing and answer actions under a cost budget (Xing et al., 17 Jun 2026).

3. Perception as planning, tool use, and active inquiry

A major theme in Agent-Omni research is the rejection of purely passive “watch-it-all” or “ingest-it-all” processing. In the test-time coordination framework, the master agent interprets user intent, emits subtasks, calls off-the-shelf text, image, audio, or video agents, and may iterate for multiple rounds until a stopping condition is reached (Lin et al., 4 Nov 2025). This is a planning layer over fixed experts rather than joint omni-scale fine-tuning.

In long audio-video reasoning, the same logic becomes more explicit. OmniRAG-Agent defines a tool set $U = \{\mathrm{RETRIEVEIMG}(\mathrm{query}), \mathrm{RETRIEVEAUD}(\mathrm{query})\}$ over separate frame and ASR-derived audio banks, then runs a multi-turn loop in which the OmniLLM emits a short plan $I$ 0, a retrieval query $I$ 1, and a stop/continue flag $I$ 2 (Zhu et al., 3 Feb 2026). The policy is optimized with group relative policy optimization, so retrieval decisions, stop control, and final answer quality are tied together by a trajectory reward rather than trained as isolated modules (Zhu et al., 3 Feb 2026).

OmniAgent for audio-guided active perception uses a “Think–Act–Observe–Reflect” loop and a toolset partitioned into Video Tools, Audio Tools, and Event Tools. Its coarse-to-fine strategy first uses audio to localize relevant intervals and then invokes higher-resolution visual inspection on candidate segments (Tao et al., 29 Dec 2025). AOP-Agent adopts a related but differently factorized loop: Planner, Reflector, and Reasoner collaborate through repeated observe–reflect–replan cycles over a hierarchical omni-modal memory (Xu et al., 27 May 2026). Native OmniAgent pushes this further by treating active perception itself as reasoning: the agent takes symbolic actions such as selecting frames, audio windows, or clips, distills the raw percept into compact textual memory, and discards the raw media so that the internal state grows with reasoning steps rather than video duration (Xing et al., 17 Jun 2026).

This line of work directly challenges a common misconception that omni-modal competence requires a single end-to-end model that jointly ingests every modality at uniform granularity. Several papers instead argue that selective observation, tool calling, or external model coordination is necessary for fine-grained cross-modal grounding under realistic resource constraints (Tao et al., 29 Dec 2025, Zhu et al., 3 Feb 2026, Xing et al., 17 Jun 2026).

4. Memory, personalization, and persistent state

Another central dimension of Agent-Omni systems is memory. O-Mem is the clearest dedicated formulation. For each new interaction $I$ 3, an LLM extractor $I$ 4 produces a triple $I$ 5 consisting of topic, user attribute, and factual event. The memory store then updates Persona Memory, Working Memory, and Episodic Memory: $I$ 6 stores abstracted persona attributes, $I$ 7 stores factual events, $I$ 8 maps topic to interaction index, and $I$ 9 maps clue-word to interaction index. Retrieval runs in parallel across persona, working, and episodic tiers, and cosine similarity is used for ranking (Wang et al., 17 Nov 2025). O-Mem reports $T$ 0 average F1 on LoCoMo and $T$ 1 accuracy on PERSONAMEM, while also reducing token cost, latency, and peak GPU memory relative to prior memory frameworks (Wang et al., 17 Nov 2025).

X-OmniClaw extends memory beyond dialogue personalization into mobile interaction. Its Omni Memory combines runtime Working Memory, which stores task state, recent screenshots, and semantic embeddings, with Long-Term Memory derived from distilled local data such as user profile and gallery summaries (Ren et al., 7 May 2026). The technical report introduces a joint objective balancing working-memory alignment, long-term memory updates, and memory distillation to mimic higher-capacity offline memory artifacts on-device (Ren et al., 7 May 2026).

Memory also appears in active-perception systems, but there it serves evidence accumulation rather than personalization. AOP-Agent defines working memory $T$ 2 for past planner decisions, future plans, and reflections, and evidence memory $T$ 3 for observed segment indices (Xu et al., 27 May 2026). Native OmniAgent defines the hidden state as $T$ 4, where $T$ 5 is persistent textual memory and $T$ 6 the latest raw percept (Xing et al., 17 Jun 2026). OmniAtlas generates trajectories $T$ 7 in which thoughts, tool calls, and tool outputs are interleaved, and active perception functions such as read_video(video_id,t_start,t_end) allow selective access to long media (Li et al., 26 Feb 2026).

These memory designs indicate two distinct but related roles for persistent state. One is user modeling and personalization, where memory encodes long-term attributes, events, and topic structure. The other is evidential scaffolding, where memory stores observations and reasoning artifacts so that multi-step perception can accumulate toward an answer.

5. Action, grounding, and environment interaction

The action side of Agent-Omni research spans conversational response generation, UI control, robot control, and mobile skill execution. OmniActor provides the most explicit unified action formalism. The model takes either a 2D GUI screenshot or a 3D scene rendering and emits action tokens tagged with $T$ 8, so the same policy can produce either GUI commands such as click or swipe with 2D coordinates, or discretized 6-DoF plus gripper commands for robots (Yang et al., 2 Sep 2025). The paper argues that naive pooling of GUI and embodied data causes catastrophic interference because deep parameters must decode different output distributions. Its solution is to share layers $T$ 9 and separate layers $H$ 0, with two final linear heads and a shared vocabulary after embodied-action discretization (Yang et al., 2 Sep 2025).

The training setup is also explicitly unified. OmniActor uses approximately $H$ 1 million GUI grounding examples, approximately $H$ 2 million trajectory examples, a GUI:embodied ratio of approximately $H$ 3 after resampling, supervised fine-tuning with cross-entropy loss, AdamW with learning rate $H$ 4, cosine decay with $H$ 5 warmup, DeepSpeed ZERO3, and image resolution $H$ 6 (Yang et al., 2 Sep 2025). It reports $H$ 7 success on LIBERO-90 versus $H$ 8 for OmniActor-EA, and GUI scores of $H$ 9 versus $A$ 0 for OmniActor-GUI, with particularly strong gains on AndroidControl-High and GUI-Odyssey (Yang et al., 2 Sep 2025).

X-OmniClaw addresses mobile interaction through hybrid grounding. Omni Action fuses structural XML metadata and visual features from screenshot crops with a gating mechanism, then uses Behavior Cloning to extract skill cards from recorded trajectories and Trajectory Replay to restore activities via deeplinks, task-stack restoration, or progressively stripped extras (Ren et al., 7 May 2026). On $A$ 1 mobile tasks, it reports Success Rate $A$ 2, Average Steps per Task $A$ 3, Latency per Task $A$ 4 s, and Reliability Drop Rate $A$ 5, compared with lower Success Rate and higher latency or RDR for VisClick and OpenClaw (Ren et al., 7 May 2026).

Even systems that are not embodied rely on action abstractions. OpenOmni’s action surface is pipeline execution from STT through TTS, with RESTful orchestration and optional RAG (Sun et al., 2024). Agent-Omni’s actions are model calls delegated by the master agent (Lin et al., 4 Nov 2025). OmniAtlas emits tool-call or final-answer actions inside a trajectory model $A$ 6 (Li et al., 26 Feb 2026). Across these settings, “action” ranges from physical control to tool invocation, but in each case it is explicitly modeled rather than hidden inside a monolithic forward pass.

6. Evaluation, limitations, and open directions

Agent-Omni systems are evaluated on markedly different benchmarks, reflecting the breadth of the paradigm. OpenOmni measures latency, human-annotation accuracy on a $A$ 7– $A$ 8 scale, cost, and privacy, comparing configurations such as GPT4O_ETE, GPT35_ETE, HF_ETE, and QuantizLLM_ETE; for indoor assistance with visually impaired participants, GPT4O_ETE yields latency approximately $A$ 9 s per turn and accuracy approximately $t_i \in T = f_{\mathrm{decompose}}(U)$ 0 over six sample queries (Sun et al., 2024). The master-agent Agent-Omni reports category averages over text, image, video, audio, and omni benchmarks, with $t_i \in T = f_{\mathrm{decompose}}(U)$ 1, $t_i \in T = f_{\mathrm{decompose}}(U)$ 2, $t_i \in T = f_{\mathrm{decompose}}(U)$ 3, $t_i \in T = f_{\mathrm{decompose}}(U)$ 4, and $t_i \in T = f_{\mathrm{decompose}}(U)$ 5 respectively, alongside a latency trade-off of approximately $t_i \in T = f_{\mathrm{decompose}}(U)$ 6– $t_i \in T = f_{\mathrm{decompose}}(U)$ 7 s versus $t_i \in T = f_{\mathrm{decompose}}(U)$ 8– $t_i \in T = f_{\mathrm{decompose}}(U)$ 9 s for single models (Lin et al., 4 Nov 2025).

In low-resource long audio-video QA, OmniRAG-Agent shows incremental gains from Base to +RAG to +Agent to +RL on multiple backbones, including $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 0 for Qwen2.5-Omni-7B and $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 1 for Gemini 2.5-Flash (Zhu et al., 3 Feb 2026). OmniAgent for audio-guided active perception reports $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 2 on Daily-Omni, $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 3 on OmniVideoBench, and $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 4 on WorldSense, with absolute gains of $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 5– $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 6 over leading baselines on those suites (Tao et al., 29 Dec 2025). AOP-Agent reports $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 7 overall on MOV-Bench and $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 8 on OmniVideoBench, improving over direct inference particularly on long videos and reasoning-intensive subsets (Xu et al., 27 May 2026). OmniAtlas improves Pass@1 on OmniGAIA from $M_{j(i)} = g_{\mathrm{assign}}(t_i)$ 9 for Qwen3-Omni (30B) to $o_i = M_{j(i)}(t_i)$ 0 after SFT + OmniDPO, while the benchmark itself exposes a large open-versus-proprietary gap, with Gemini-3-Pro at $o_i = M_{j(i)}(t_i)$ 1 (Li et al., 26 Feb 2026). Native OmniAgent 7B reports $o_i = M_{j(i)}(t_i)$ 2 on LVBench, exceeding Qwen2.5-VL-72B’s $o_i = M_{j(i)}(t_i)$ 3, and shows positive test-time scaling on VideoMME-Long from $o_i = M_{j(i)}(t_i)$ 4 to $o_i = M_{j(i)}(t_i)$ 5 as the maximum turn limit increases from $o_i = M_{j(i)}(t_i)$ 6 to $o_i = M_{j(i)}(t_i)$ 7 (Xing et al., 17 Jun 2026).

The limitations are equally consistent. OpenOmni notes end-to-end latency in the tens of seconds, lack of streaming support, dependence on cloud proprietary models for response quality, and coarse subjective quality metrics (Sun et al., 2024). O-Mem identifies no explicit time-based forgetting mechanism for persona events, reliance on the extractor $o_i = M_{j(i)}(t_i)$ 8, and unbounded growth for extremely chatty users (Wang et al., 17 Nov 2025). OmniActor states that current embodied data is dominated by LIBERO and that online reinforcement learning on the unified agent remains unexplored (Yang et al., 2 Sep 2025). Agent-Omni and OmniAgent variants repeatedly report increased latency due to sequential or multi-step model calls, reliance on external APIs, prompt-history or memory bottlenecks, and the need for stronger retrieval, verification, or non-textual output capabilities (Lin et al., 4 Nov 2025, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Xing et al., 17 Jun 2026).

Taken together, these works define Agent-Omni as a research program centered on omni-modal agency rather than simple multimodal ingestion. Its characteristic commitments are selective perception, structured memory, explicit tool or action spaces, and architectures that preserve specialization while attempting cross-modal integration. Whether implemented as a benchmarkable pipeline, a master-agent coordinator, a generalist control policy, a personalized memory system, or a native active-perception loop, the unifying objective is the same: to move from multimodal models that merely process inputs toward agents that can decide what to observe, what to remember, what to call, and how to act across heterogeneous worlds (Sun et al., 2024, Lin et al., 4 Nov 2025, Yang et al., 2 Sep 2025, Wang et al., 17 Nov 2025, Ren et al., 7 May 2026, Zhu et al., 3 Feb 2026, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Li et al., 26 Feb 2026, Xing et al., 17 Jun 2026).