OmniPerson: Unified Identity & Person AI

Updated 9 December 2025

OmniPerson is a unified framework integrating identity-preserving pedestrian generation, omni-domain re-identification, and self-evolving memory architectures for personalized agents.
It leverages techniques like Latent Diffusion Models and Multi-Refer Fuser modules to fuse multi-view identity, pose, text, and temporal cues for enhanced performance.
The system demonstrates state-of-the-art results across generation, detection, and re-identification tasks using diverse datasets and rigorous benchmark metrics.

OmniPerson is a unified designation used in the literature for both (a) next-generation, identity-preserving pedestrian generation systems and (b) omni-domain, omni-modal person-centric AI frameworks. It encompasses three primary axes: (1) identity-preserving generative modeling, (2) omni-domain and multi-modal person re-identification (ReID), and (3) universal, persistent, self-evolving memory architectures for personalized agents. This article presents core formulations, architectures, algorithms, datasets, performance metrics, and system-level extensions as defined and implemented in state-of-the-art studies.

1. Identity-Preserving Pedestrian Generation

OmniPerson refers to a unified pedestrian generation pipeline capable of high-fidelity, identity-consistent synthesis for visible and infrared image/video ReID tasks (Ma et al., 2 Dec 2025). The architecture is based on a Latent Diffusion Model (LDM) with a U-Net denoiser $\epsilon_\theta$ acting in latent space. The system is designed for fine-grained and holistic control of key pedestrian attributes—including appearance, pose, background, and modality—while supporting RGB/IR image or video generation, RGB-to-IR transfer, and super-resolution.

Multi-Refer Fuser

A core module, the Multi-Refer Fuser, distills a unified identity embedding from any number of reference images. Features from multiple encoding stages are fused using channel-attention (for low-level) and multi-head self-attention (for high-level features). These fused identity embeddings are injected into the U-Net denoiser at each Transformer block through augmented self-attention mechanisms.

Conditioning and Guidance

Control vectors/embeddings encode:

Multi-view identity ( $c_{id}$ via Multi-Refer Fuser)
Pose and background ( $E_{spatial}$ ; fused outputs from dedicated CNN encoders)
Text and modality ( $c_{text}$ from frozen CLIP text encoder)
Temporal context (for video; $c_{tem}$ via temporal-attention modules)

At each denoising timestep, the loss minimized is:

$L = \mathbb{E}_{z_0, t, \epsilon}\left\Vert \epsilon - \epsilon_\theta(z_t, t, c) \right\Vert_2^2$

where $c=\{c_{id}, E_{spatial}, c_{text}, c_{tem}\}$ , and classifier-free guidance is applied by random dropout of conditioning branches during training.

PersonSyn Dataset and Curation

PersonSyn is the first large-scale dataset for multi-reference controllable pedestrian generation, constructed by transforming ID-only ReID benchmarks with dense, multi-modal annotations. The pipeline extracts pose (2D keypoint, 3D SMPL-X mesh), view orientation, text attributes, and backgrounds, with stringent data cleaning and reference selection protocols.

OmniPerson systems extend beyond single-domain or single-modality scenarios, targeting robust generalization and retrieval under arbitrary deployment conditions. This includes:

Omni-Domain Generalized ReID

The Aligned Divergent Pathways (ADP) architecture (Ang et al., 11 Oct 2024) realizes omni-domain generalization (ODG-ReID) by duplicating the terminal layers of standard backbones (e.g., ViT-b-16) into multiple parallel branches, each with unique normalization (DyMAIN), branch-specific learning-rate schedules (PMoC), and feature realignment via Dimensional Consistency Metric Loss (DCML). At inference, features from all branches are fused via mean aggregation for robust, domain-invariant representations.

A distinct approach, Diverse Deep Feature Ensemble Learning (D²FEL) (Ang et al., 11 Oct 2024), spawns multiple sub-heads from the backbone tail, each incorporating a distinct pattern of instance normalization within the final blocks. The concatenation of these ensemble heads is dimension-reduced (e.g., PCA, RP), yielding highly-diverse yet compact encodings that drive strong generalization on domain transfer and supervised benchmarks.

ReID5o (Zuo et al., 11 Jun 2025) addresses omni multi-modal person ReID (OM-ReID) by introducing:

ORBench, a five-modal (RGB, infrared, color-pencil drawing, sketch, text) ReID dataset with 1,000 identities.
Multi-modal Tokenizing Assembler (MTA): independent modality-specific tokenizers.
Unified encoder (CLIP-B/16) extended with Multi-Expert Router (MER), which injects low-rank, modality-specific parameterizations into each layer conditioned by the active modalities.
Flexible Feature Mixture (FM) block, allowing arbitrary combinations of input modalities.
Learning objectives include distance-metric alignment to RGB and identity classification.

ReID5o achieves substantial improvements in mean average precision (mAP) as modalities are fused, from 58.09% (single-modal) to 86.35% (quad-modal) on ORBench.

3. Person Detection in Omnidirectional Imagery

OmniPerson also refers to the application of one-step CNN-based person detectors on top-view omnidirectional scenes (Yu et al., 2022). These detectors (omni-SSD variants with MobileNet/ResNet backbones) require no undistortion and operate directly on fisheye images, leveraging extensive data augmentation including random rotations and vertical flips. Training involves stepwise fine-tuning from COCO-pretrained models through perspective and omni-domain datasets, with peak mAPs of 86.3% on evaluation sets at real-time inference speeds.

4. Universal, Personalized Long-Horizon Agents

A distinct axis is rooted in the O-Mem architecture, which supports persistent, adaptive personalization for AI assistants (Wang et al., 17 Nov 2025). The framework maintains three stores—Persona, Working, and Episodic Memory—continuously updated through an active profiling pipeline. User turns are parsed to extract topic, attribute, and factual event tuples, which are triaged into dedicated memory structures by semantic and graph-based clustering. At inference, retrieval combines hierarchical top- $k$ similarity over semantic, topical, and episodic indices, with unioned results supplied to the LLM for generation. Benchmarks indicate substantial gains in F1, accuracy, alignment, and efficiency compared to prior memory-centric systems.

An OmniPerson assistant can extend O-Mem to support multi-modal attributes (e.g., adding vision/audio), temporal decay (Ebbinghaus-like forgetting curves), dialogue state tracking, high-speed vectorized episodic search, privacy filtering, and meta-controllers that mediate memory-component token allocation as a function of query complexity.

5. Empirical Results and Benchmarks

Key quantitative outcomes across the OmniPerson landscape include:

System/Task	mAP/SSIM/F1/Etc.	Experimental Setting
OmniPerson (generation)	LPIPS=0.202, SSIM=0.467, PSNR=17.16	PersonSyn, ablations (Ma et al., 2 Dec 2025)
ADP (ODG-ReID)	mAP=62.4% (DG), mAP=62.0% (supervised)	Market-1501, MSMT17 (Ang et al., 11 Oct 2024)
D²FEL (ODG-ReID)	mAP=45.2% (DG), mAP=87.6% (supervised)	Multiple datasets (Ang et al., 11 Oct 2024)
ReID5o (OM-ReID)	mAP=58.09% (MM-1), 86.35% (MM-4)	ORBench, all modalities (Zuo et al., 11 Jun 2025)
OmniPD (Omnidirectional detection)	AP=86.3%, speed=38ms/img	moSSD/resSSD (Yu et al., 2022)
O-Mem (Agent memory)	F1=51.67%, acc=62.99%, 2.4s latency	LoCoMo, PERSONAMEM (Wang et al., 17 Nov 2025)

These results establish state-of-the-art performance in identity preservation, domain/multi-modal generalization, and efficient long-horizon memory handling.

6. Extensions, Limitations, and Future Directions

Generative: Extension of identity loss (contrastive/triplet) further strengthens ID-consistency; Multi-Refer Fuser is agnostic to the vision backbone; spatial/temporal conditioning mechanisms are easily extensible.
ReID: Modality expansion (e.g., to thermal, radar, depth), robustness to real-world data imperfections, and optimization for low-latency and privacy remain open.
Agent/memory: Incorporation of privacy-compliant controls (GDPR/PIPL), continual self-supervision, and dynamic memory allocation via meta-controllers are actionable enhancements.
Detection: Adoption of synthetic fisheye data, advanced geometric augmentation, and rotation/spherical equivariance may further increase boundary robustness.

Across the OmniPerson theme, a unifying property is system flexibility—conditional control, fusion architectures, and memory schemas that support robust and extensible deployment in diverse, open-world scenarios. Each core component is rigorously benchmarked, modular, and open-source, providing a comprehensive foundation for further research and applications (Ma et al., 2 Dec 2025, Ang et al., 11 Oct 2024, Ang et al., 11 Oct 2024, Zuo et al., 11 Jun 2025, Yu et al., 2022, Wang et al., 17 Nov 2025).