Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent-Omni: Unified Multimodal Agency Research

Updated 3 July 2026
  • Agent-Omni is a research paradigm for unified multimodal agency characterized by selective perception, structured memory, and explicit action coordination.
  • Key architectural patterns include modular decomposition, explicit state structuring, and asymmetric parameter sharing to manage diverse modalities.
  • Empirical evaluations highlight trade-offs in latency, accuracy, and resource efficiency, underscoring challenges in real-world multimodal integration.

Searching arXiv for the cited "Agent-Omni" and closely related omni-agent papers to ground the article in current preprints. arXiv search query: all:"Agent-Omni" OR ti:"Agent-Omni" Agent-Omni is a research label used in recent arXiv literature for systems that seek unified multimodal agency across perception, reasoning, memory, retrieval, and action. In the available corpus, the term does not denote a single canonical architecture. Instead, it appears across several related but distinct formulations: an end-to-end multimodal conversational pipeline in OpenOmni (Sun et al., 2024), a master-agent framework for test-time coordination of specialized foundation models (Lin et al., 4 Nov 2025), a generalist policy spanning GUI and embodied control in OmniActor (Yang et al., 2 Sep 2025), a hierarchical memory system for personalized long-horizon interaction in O-Mem (Wang et al., 17 Nov 2025), a mobile perception–memory–action loop in X-OmniClaw (Ren et al., 7 May 2026), and multiple active-perception frameworks for audio-video reasoning and tool use (Zhu et al., 3 Feb 2026, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Li et al., 26 Feb 2026, Xing et al., 17 Jun 2026). This suggests that “Agent-Omni” functions less as a standardized product name than as a design paradigm for agents intended to operate across heterogeneous modalities, environments, and temporal horizons.

1. Terminological scope and conceptual identity

The literature uses the Agent-Omni label for systems that differ in substrate and objective. Some are primarily conversational and benchmark-oriented, some are test-time orchestration frameworks, some are memory systems, and some are action agents operating in mobile, GUI, robotic, or long-video settings.

System Primary focus Distinctive mechanism
OpenOmni Multimodal conversational pipeline Five-module end-to-end framework
OmniActor GUI and embodied control Layer-heterogeneity MoE
O-Mem Personalized long-horizon memory Active user profiling + hierarchical retrieval
X-OmniClaw Mobile-native interaction Perception–Memory–Action on-device loop
Agent-Omni Test-time multimodal reasoning Master-agent coordination without retraining
OmniRAG-Agent / OmniAgent / AOP-Agent / OmniAtlas Audio-video active perception and tool use Planning loops, retrieval, reflection, or active sensing

OpenOmni explicitly presents “Agent-Omni” as the OpenOmni framework for building and benchmarking multimodal conversational agents, organized into Client, API, Storage, Agent, and User Interface modules (Sun et al., 2024). By contrast, the paper titled “Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything” defines Agent-Omni as a master-agent system that decomposes a multimodal request into subtasks, assigns them to modality-specific agents, and integrates their outputs into a final textual response (Lin et al., 4 Nov 2025). Other works use the label more broadly to describe a unified agentic paradigm spanning perception, memory, and action on smartphones, or active omni-modal perception over long audio-video inputs (Ren et al., 7 May 2026, Xu et al., 27 May 2026).

A plausible implication is that the term has become a shorthand for a family of omni-modal agent designs rather than a single lineage. The common denominator is not a fixed implementation, but an attempt to combine modality integration with explicit agency: planning, tool invocation, memory updates, or environment interaction.

2. Recurrent architectural patterns

Despite the heterogeneity of implementations, several architectural motifs recur. One is modular decomposition. OpenOmni exposes Speech-to-Text, Emotion Detection, Retrieval-Augmented Generation, LLM integration, and Text-to-Speech as interchangeable components behind standard interfaces, with local and cloud deployment options and benchmarking over latency, accuracy, cost, and privacy (Sun et al., 2024). Agent-Omni similarly decomposes the system into a User Intent Interpreter, Subtask Scheduler, Modality Adapters, Agent Pool, and Response Assembler, with the master agent coordinating specialized models rather than retraining a monolithic omni-model (Lin et al., 4 Nov 2025).

A second motif is explicit state structuring. X-OmniClaw is organized around Omni Perception, Omni Memory, and Omni Action, with Working Memory for task continuity and Long-Term Memory distilled from local data (Ren et al., 7 May 2026). O-Mem separates Persona Memory, Working Memory, and Episodic Memory, each with different update and retrieval rules (Wang et al., 17 Nov 2025). AOP-Agent distinguishes a global summary, mid-level segments, fine-grained clips, working memory, and evidence memory in a hierarchical omni-modal memory stack (Xu et al., 27 May 2026).

A third motif is asymmetric parameter sharing. OmniActor does not treat all modalities as uniformly compatible. It reports that GUI and embodied data show synergy in shallow layers and conflict in deep layers, motivating a Layer-heterogeneity MoE in which shallow layers are shared and deep layers are separated into GUI and robot experts (Yang et al., 2 Sep 2025). This differs from master-agent systems that keep models separate and coordinate them externally, but both approaches attempt to preserve modality-specific competence while exploiting shared structure.

Formally, these systems often introduce an explicit decomposition between understanding and execution. OmniActor seeks a single policy π\pi that maps an image II, task instruction TT, and optional textual history HH to action tokens AA, with each action token tagged as either GUI or robot (Yang et al., 2 Sep 2025). Agent-Omni defines a decomposition tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U), assignment Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i), execution oi=Mj(i)(ti)o_i = M_{j(i)}(t_i), and final integration O=fintegrate({oi})O = f_{\mathrm{integrate}}(\{o_i\}) (Lin et al., 4 Nov 2025). Native OmniAgent further casts long-form omni-modal understanding as a POMDP with sensing and answer actions under a cost budget (Xing et al., 17 Jun 2026).

3. Perception as planning, tool use, and active inquiry

A major theme in Agent-Omni research is the rejection of purely passive “watch-it-all” or “ingest-it-all” processing. In the test-time coordination framework, the master agent interprets user intent, emits subtasks, calls off-the-shelf text, image, audio, or video agents, and may iterate for multiple rounds until a stopping condition is reached (Lin et al., 4 Nov 2025). This is a planning layer over fixed experts rather than joint omni-scale fine-tuning.

In long audio-video reasoning, the same logic becomes more explicit. OmniRAG-Agent defines a tool set U={RETRIEVEIMG(query),RETRIEVEAUD(query)}U = \{\mathrm{RETRIEVEIMG}(\mathrm{query}), \mathrm{RETRIEVEAUD}(\mathrm{query})\} over separate frame and ASR-derived audio banks, then runs a multi-turn loop in which the OmniLLM emits a short plan II0, a retrieval query II1, and a stop/continue flag II2 (Zhu et al., 3 Feb 2026). The policy is optimized with group relative policy optimization, so retrieval decisions, stop control, and final answer quality are tied together by a trajectory reward rather than trained as isolated modules (Zhu et al., 3 Feb 2026).

OmniAgent for audio-guided active perception uses a “Think–Act–Observe–Reflect” loop and a toolset partitioned into Video Tools, Audio Tools, and Event Tools. Its coarse-to-fine strategy first uses audio to localize relevant intervals and then invokes higher-resolution visual inspection on candidate segments (Tao et al., 29 Dec 2025). AOP-Agent adopts a related but differently factorized loop: Planner, Reflector, and Reasoner collaborate through repeated observe–reflect–replan cycles over a hierarchical omni-modal memory (Xu et al., 27 May 2026). Native OmniAgent pushes this further by treating active perception itself as reasoning: the agent takes symbolic actions such as selecting frames, audio windows, or clips, distills the raw percept into compact textual memory, and discards the raw media so that the internal state grows with reasoning steps rather than video duration (Xing et al., 17 Jun 2026).

This line of work directly challenges a common misconception that omni-modal competence requires a single end-to-end model that jointly ingests every modality at uniform granularity. Several papers instead argue that selective observation, tool calling, or external model coordination is necessary for fine-grained cross-modal grounding under realistic resource constraints (Tao et al., 29 Dec 2025, Zhu et al., 3 Feb 2026, Xing et al., 17 Jun 2026).

4. Memory, personalization, and persistent state

Another central dimension of Agent-Omni systems is memory. O-Mem is the clearest dedicated formulation. For each new interaction II3, an LLM extractor II4 produces a triple II5 consisting of topic, user attribute, and factual event. The memory store then updates Persona Memory, Working Memory, and Episodic Memory: II6 stores abstracted persona attributes, II7 stores factual events, II8 maps topic to interaction index, and II9 maps clue-word to interaction index. Retrieval runs in parallel across persona, working, and episodic tiers, and cosine similarity is used for ranking (Wang et al., 17 Nov 2025). O-Mem reports TT0 average F1 on LoCoMo and TT1 accuracy on PERSONAMEM, while also reducing token cost, latency, and peak GPU memory relative to prior memory frameworks (Wang et al., 17 Nov 2025).

X-OmniClaw extends memory beyond dialogue personalization into mobile interaction. Its Omni Memory combines runtime Working Memory, which stores task state, recent screenshots, and semantic embeddings, with Long-Term Memory derived from distilled local data such as user profile and gallery summaries (Ren et al., 7 May 2026). The technical report introduces a joint objective balancing working-memory alignment, long-term memory updates, and memory distillation to mimic higher-capacity offline memory artifacts on-device (Ren et al., 7 May 2026).

Memory also appears in active-perception systems, but there it serves evidence accumulation rather than personalization. AOP-Agent defines working memory TT2 for past planner decisions, future plans, and reflections, and evidence memory TT3 for observed segment indices (Xu et al., 27 May 2026). Native OmniAgent defines the hidden state as TT4, where TT5 is persistent textual memory and TT6 the latest raw percept (Xing et al., 17 Jun 2026). OmniAtlas generates trajectories TT7 in which thoughts, tool calls, and tool outputs are interleaved, and active perception functions such as read_video(video_id,t_start,t_end) allow selective access to long media (Li et al., 26 Feb 2026).

These memory designs indicate two distinct but related roles for persistent state. One is user modeling and personalization, where memory encodes long-term attributes, events, and topic structure. The other is evidential scaffolding, where memory stores observations and reasoning artifacts so that multi-step perception can accumulate toward an answer.

5. Action, grounding, and environment interaction

The action side of Agent-Omni research spans conversational response generation, UI control, robot control, and mobile skill execution. OmniActor provides the most explicit unified action formalism. The model takes either a 2D GUI screenshot or a 3D scene rendering and emits action tokens tagged with TT8, so the same policy can produce either GUI commands such as click or swipe with 2D coordinates, or discretized 6-DoF plus gripper commands for robots (Yang et al., 2 Sep 2025). The paper argues that naive pooling of GUI and embodied data causes catastrophic interference because deep parameters must decode different output distributions. Its solution is to share layers TT9 and separate layers HH0, with two final linear heads and a shared vocabulary after embodied-action discretization (Yang et al., 2 Sep 2025).

The training setup is also explicitly unified. OmniActor uses approximately HH1 million GUI grounding examples, approximately HH2 million trajectory examples, a GUI:embodied ratio of approximately HH3 after resampling, supervised fine-tuning with cross-entropy loss, AdamW with learning rate HH4, cosine decay with HH5 warmup, DeepSpeed ZERO3, and image resolution HH6 (Yang et al., 2 Sep 2025). It reports HH7 success on LIBERO-90 versus HH8 for OmniActor-EA, and GUI scores of HH9 versus AA0 for OmniActor-GUI, with particularly strong gains on AndroidControl-High and GUI-Odyssey (Yang et al., 2 Sep 2025).

X-OmniClaw addresses mobile interaction through hybrid grounding. Omni Action fuses structural XML metadata and visual features from screenshot crops with a gating mechanism, then uses Behavior Cloning to extract skill cards from recorded trajectories and Trajectory Replay to restore activities via deeplinks, task-stack restoration, or progressively stripped extras (Ren et al., 7 May 2026). On AA1 mobile tasks, it reports Success Rate AA2, Average Steps per Task AA3, Latency per Task AA4 s, and Reliability Drop Rate AA5, compared with lower Success Rate and higher latency or RDR for VisClick and OpenClaw (Ren et al., 7 May 2026).

Even systems that are not embodied rely on action abstractions. OpenOmni’s action surface is pipeline execution from STT through TTS, with RESTful orchestration and optional RAG (Sun et al., 2024). Agent-Omni’s actions are model calls delegated by the master agent (Lin et al., 4 Nov 2025). OmniAtlas emits tool-call or final-answer actions inside a trajectory model AA6 (Li et al., 26 Feb 2026). Across these settings, “action” ranges from physical control to tool invocation, but in each case it is explicitly modeled rather than hidden inside a monolithic forward pass.

6. Evaluation, limitations, and open directions

Agent-Omni systems are evaluated on markedly different benchmarks, reflecting the breadth of the paradigm. OpenOmni measures latency, human-annotation accuracy on a AA7–AA8 scale, cost, and privacy, comparing configurations such as GPT4O_ETE, GPT35_ETE, HF_ETE, and QuantizLLM_ETE; for indoor assistance with visually impaired participants, GPT4O_ETE yields latency approximately AA9 s per turn and accuracy approximately tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)0 over six sample queries (Sun et al., 2024). The master-agent Agent-Omni reports category averages over text, image, video, audio, and omni benchmarks, with tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)1, tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)2, tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)3, tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)4, and tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)5 respectively, alongside a latency trade-off of approximately tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)6–tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)7 s versus tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)8–tiT=fdecompose(U)t_i \in T = f_{\mathrm{decompose}}(U)9 s for single models (Lin et al., 4 Nov 2025).

In low-resource long audio-video QA, OmniRAG-Agent shows incremental gains from Base to +RAG to +Agent to +RL on multiple backbones, including Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)0 for Qwen2.5-Omni-7B and Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)1 for Gemini 2.5-Flash (Zhu et al., 3 Feb 2026). OmniAgent for audio-guided active perception reports Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)2 on Daily-Omni, Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)3 on OmniVideoBench, and Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)4 on WorldSense, with absolute gains of Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)5–Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)6 over leading baselines on those suites (Tao et al., 29 Dec 2025). AOP-Agent reports Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)7 overall on MOV-Bench and Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)8 on OmniVideoBench, improving over direct inference particularly on long videos and reasoning-intensive subsets (Xu et al., 27 May 2026). OmniAtlas improves Pass@1 on OmniGAIA from Mj(i)=gassign(ti)M_{j(i)} = g_{\mathrm{assign}}(t_i)9 for Qwen3-Omni (30B) to oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)0 after SFT + OmniDPO, while the benchmark itself exposes a large open-versus-proprietary gap, with Gemini-3-Pro at oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)1 (Li et al., 26 Feb 2026). Native OmniAgent 7B reports oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)2 on LVBench, exceeding Qwen2.5-VL-72B’s oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)3, and shows positive test-time scaling on VideoMME-Long from oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)4 to oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)5 as the maximum turn limit increases from oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)6 to oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)7 (Xing et al., 17 Jun 2026).

The limitations are equally consistent. OpenOmni notes end-to-end latency in the tens of seconds, lack of streaming support, dependence on cloud proprietary models for response quality, and coarse subjective quality metrics (Sun et al., 2024). O-Mem identifies no explicit time-based forgetting mechanism for persona events, reliance on the extractor oi=Mj(i)(ti)o_i = M_{j(i)}(t_i)8, and unbounded growth for extremely chatty users (Wang et al., 17 Nov 2025). OmniActor states that current embodied data is dominated by LIBERO and that online reinforcement learning on the unified agent remains unexplored (Yang et al., 2 Sep 2025). Agent-Omni and OmniAgent variants repeatedly report increased latency due to sequential or multi-step model calls, reliance on external APIs, prompt-history or memory bottlenecks, and the need for stronger retrieval, verification, or non-textual output capabilities (Lin et al., 4 Nov 2025, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Xing et al., 17 Jun 2026).

Taken together, these works define Agent-Omni as a research program centered on omni-modal agency rather than simple multimodal ingestion. Its characteristic commitments are selective perception, structured memory, explicit tool or action spaces, and architectures that preserve specialization while attempting cross-modal integration. Whether implemented as a benchmarkable pipeline, a master-agent coordinator, a generalist control policy, a personalized memory system, or a native active-perception loop, the unifying objective is the same: to move from multimodal models that merely process inputs toward agents that can decide what to observe, what to remember, what to call, and how to act across heterogeneous worlds (Sun et al., 2024, Lin et al., 4 Nov 2025, Yang et al., 2 Sep 2025, Wang et al., 17 Nov 2025, Ren et al., 7 May 2026, Zhu et al., 3 Feb 2026, Tao et al., 29 Dec 2025, Xu et al., 27 May 2026, Li et al., 26 Feb 2026, Xing et al., 17 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-Omni.