Compute-Optimal Demo Selection in Diffusion Models
- Compute-Optimal Demonstration Selection is a strategy that uses mixed memory and autoregressive diffusion to select key motion segments for coherent human interactions.
- It integrates parallel transformer streams and specialized prediction heads to ensure synchronized body, hand, and trajectory dynamics.
- The approach achieves smoother transitions and lower memory overhead, significantly enhancing metrics like FID, R-Precision, and temporal smoothness.
Searching arXiv for the Interact2Ar paper and adjacent literature to ground the article. Interact2Ar is a text-conditioned autoregressive diffusion model for generating full-body human-human interactions, introduced as the first end-to-end model of this kind with detailed hand kinematics (Ruiz-Ponce et al., 22 Dec 2025). It addresses dyadic motion generation in settings where realism depends not only on plausible individual body motion but also on coherent coordination between interactants over time, including contacts, proximity, and fine-grained hand articulation. The method combines cooperative denoisers, body-part-specialized prediction heads, and an autoregressive pipeline with a Mixed Memory mechanism so that generation can remain reactive and adaptive instead of denoising an entire motion sequence in a single pass (Ruiz-Ponce et al., 22 Dec 2025).
1. Problem setting and scope
Interact2Ar is formulated around the generation of realistic, text-conditioned, full-body human-human interaction motion. In the paper’s formulation, an interaction is a dyad comprising two people whose motions must be individually plausible and jointly coherent over time. The motivation for modeling hands explicitly is central: hands are described as critical to non-verbal communication and physical interaction, while omitting them removes a large channel of expressivity and undermines realism (Ruiz-Ponce et al., 22 Dec 2025).
The method is positioned against three limitations in prior work. First, many existing diffusion- and transformer-based human motion generators either ignore hands entirely or model them separately with insufficient body context. Second, most diffusion pipelines denoise an entire motion at once, which weakens the ability to react to a partner’s evolving motion and can produce repetitive artifacts on long horizons. Third, reactive and adaptive interaction generation is intrinsically difficult because each agent changes in response to subtle cues from the other person (Ruiz-Ponce et al., 22 Dec 2025).
A common misunderstanding is to read the suffix “Ar” as a reference to augmented reality. In this work, however, the paper uses “AR” to denote the autoregressive variant reported in the experiments, and the contribution is a motion-generation model rather than an augmented-reality interaction system. This distinction matters because the central technical problem is temporal interaction synthesis from text, not scene registration, tracking, or in-situ visualization.
2. Motion representation and cooperative denoiser architecture
Interact2Ar uses a per-person SMPL-X representation that is explicitly non-redundant. For each individual , a frame is represented by root translation , root rotation , body joint rotations , and hand joint rotations . A dyadic sequence is then written as . The paper states that body shapes are normalized to a neutral template because of limited shape diversity in the Inter-X dataset (Ruiz-Ponce et al., 22 Dec 2025).
The model architecture is built around cooperative denoisers with specialized heads. Two parallel transformer-encoder streams, one for each person, share weights. Cross-attention between these streams allows each person’s prediction to be conditioned on the other’s motion context, which the paper presents as a way to preserve interpersonal dependencies while reducing parameter count. A shared motion encoder consumes the noised motion chunk, the text condition , and the diffusion timestep , and its latent representation is routed to three specialized decoders that predict global trajectory, body pose, and hand pose in parallel (Ruiz-Ponce et al., 22 Dec 2025).
| Component | Role | Configuration |
|---|---|---|
| Trajectory head | Predicts global root translation/orientation trajectory | 4 transformer blocks, 4 heads, hidden dim 256, FFN dim 512 |
| Body head | Predicts body joint rotations | 8 transformer blocks, 8 heads, hidden dim 512, FFN dim 1024 |
| Hands head | Predicts hand joint rotations | Same depth and dimensions as body head |
The architectural division into trajectory, body, and hands is not merely organizational. The paper states that the heads run in parallel while remaining conditioned on the same encoded latent, with the explicit goal of preserving body-hand-trajectory coherence while letting each head specialize to different kinematic scales and dimensionalities. This suggests a design choice aimed at keeping full-body generation tractable without separating hand synthesis into an externally staged module.
3. Autoregressive diffusion and Mixed Memory
The defining procedural element of Interact2Ar is its autoregressive factorization over time. A sequence of length is decomposed into non-overlapping sub-motions of length 0:
1
Instead of generating the full interaction in one diffusion run, the model generates chunk 2 conditioned on a memory of previously generated frames. This is the mechanism through which the paper claims to recover reactivity, state awareness, and long-horizon coherence (Ruiz-Ponce et al., 22 Dec 2025).
The short-term memory at generation step 3 is
4
To avoid repetition and limited context, the method adds a long-term memory over a larger window 5 but downsampled with stride 6:
7
The full memory is the concatenation
8
and the denoiser predicts the clean sub-motion from a noised chunk via
9
This Mixed Memory mechanism is presented as the paper’s main device for combining recent full-rate detail with compact long-range context. The selected hyperparameters from the ablations are 0, 1, and 2, yielding a 60-frame context while storing 24 frames. The paper reports up to 3 memory reduction at equal context coverage (Ruiz-Ponce et al., 22 Dec 2025).
The diffusion model follows a standard DDPM/DDIM-style forward process with Gaussian noise, but the denoiser predicts 4 directly at each step rather than 5 or 6. The paper links this choice to direct supervision in both representation space and FK-derived kinematic spaces. Sampling differs by regime: the non-autoregressive model uses DDIM-50 at test time, whereas the autoregressive model achieves its best empirical results with 10 denoising steps per chunk (Ruiz-Ponce et al., 22 Dec 2025).
4. Training objective, optimization, and evaluation protocol
Training combines parameter-space reconstruction with kinematic supervision. The total loss is
7
The representation term is 8. Global joint positions 9 and predicted positions 0 are obtained with forward kinematics, and velocities are finite differences of those positions. The foot-contact term penalizes foot velocities during contact, while 1 enforces inter-person proximity through masked pairwise distance maps between all joints of person 2 and person 3 (Ruiz-Ponce et al., 22 Dec 2025).
The reported training setup uses Inter-X, described as containing 11K full-body dyadic interactions, 40 actions, and rich text annotations. Optimization uses AdamW with learning rate 4, weight decay 5, batch size 128, 5000 epochs, and EMA. The supplementary loss weights are 6, 7, 8, 9, 0, and 1 (Ruiz-Ponce et al., 22 Dec 2025).
The evaluation protocol is unusually prominent in the contribution. In addition to standard metrics such as R-Precision, FID, Multimodal Distance, Diversity, and MultiModality, the paper retrains evaluators to use only global joint positions rather than rotations and to support body-part-specific assessment through full-body, body-only, and hand-only heads. These evaluators are trained with contrastive motion-text encoders following T2M for 300 epochs with learning rate 2 and feature dimension 512. For long-horizon adaptivity, the paper adds Peak Jerk and Area Under the Jerk on long sequences formed by concatenating eight motions; lower values indicate smoother transitions (Ruiz-Ponce et al., 22 Dec 2025).
The evaluator robustness study is also integral to the methodology. According to the paper, the original Inter-X evaluator is relatively insensitive to severe trajectory degradations, whereas the retrained global-position evaluators strongly penalize them. This is used to justify the claim that the new evaluation suite is better aligned with inter-person spatial coherence and global realism.
5. Quantitative results and reported capabilities
The main reported quantitative gains are against InterMask and are broken out by full-body, body-only, and hands-only evaluators. The numbers below are given exactly as reported for InterMask and the autoregressive Interact2Ar variant.
| Evaluator | Metric | InterMask → Interact2Ar AR |
|---|---|---|
| Full-body | R-Prec Top-1 | 0.415 → 0.453 |
| Full-body | FID | 0.671 → 0.277 |
| Full-body | MM Dist | 3.487 → 3.095 |
| Body-only | R-Prec Top-1 | 0.386 → 0.469 |
| Body-only | FID | 6.720 → 0.352 |
| Body-only | MM Dist | 4.616 → 3.173 |
| Hands-only | R-Prec Top-1 | 0.360 → 0.422 |
| Hands-only | FID | 1.960 → 0.257 |
| Hands-only | MM Dist | 3.794 → 3.111 |
The paper states that autoregression consistently outperforms the non-autoregressive variant across full, body, and hand metrics. Transition smoothness is also reported to improve markedly: Peak Jerk decreases from 2.328 for InterMask to 0.136 for Interact2Ar AR, and Area Under the Jerk decreases from 61.74 to 8.84. The qualitative interpretation supplied by the paper is that Interact2Ar yields smoother temporal composition and adaptation under prompt switches and disturbances (Ruiz-Ponce et al., 22 Dec 2025).
A user study with 35 participants reports that Interact2Ar is preferred over InterMask and InterGen for both overall quality and text alignment, as well as for hand realism, and is described as approaching ground-truth quality. The paper further identifies three downstream applications enabled by the autoregressive-plus-memory design: temporal motion composition, real-time adaptation to disturbances, and sequential multi-person interactions. In the last of these, one person can finish an interaction, receive a new partner and prompt, and continue generation smoothly without retraining (Ruiz-Ponce et al., 22 Dec 2025).
The main limitation explicitly stated is that neutral shape normalization in Inter-X constrains accurate modeling of shape-dependent contacts, especially precise hand-hand and hand-body contacts. A plausible implication is that future gains in contact fidelity may depend as much on dataset geometry and shape diversity as on generative modeling alone.
6. Adjacent uses of “Interact2Ar” in AR interaction literature
In adjacent AR literature, several papers position their systems relative to an “Interact2Ar” vision or an “Interact2Ar-style system,” but these works address augmented-reality interaction rather than full-body human motion generation. “Sketched Reality” describes bi-directional coupling between AR sketches and actuated tangible user interfaces, using Sony Toio robots, an iPad-based AR sketching tool, Matter.js physics, and WebSocket-mediated synchronization so that virtual walls, springs, pendulums, and linkages can affect physical robots and vice versa (Kaimoto et al., 2022). This establishes one line of association in which “Interact2Ar” denotes a broader AR-mediated coupling paradigm rather than the diffusion model.
A second line concerns safe and collaborative AR interfaces in robotics. “AR-based interaction for safe human-robot collaborative manufacturing” uses a depth-sensor-based workspace model with explicit robot, danger, and human zone masks, together with projector-based AR or HoloLens-based AR, to visualize the minimum protective distance and gate robot motion through ENABLE plus GO or CONFIRM controls; the paper reports 21–24% reduction in total task time and 57–64% reduction in robot idle time relative to a baseline (Hietanen et al., 2019). “Avatar-centred AR Collaborative Mobile Interaction” instead studies co-located multi-user mobile AR around a shared physical marker, with Unity, Vuforia, Photon PUN, and a synchronized avatar-centric interaction model; it reports an overall SUS score of 85.87 and emphasizes competitive and cooperative shared-state interaction (Marques et al., 2023).
More papers extend this AR-oriented framing toward accessibility and semantic scene understanding. “Accessible Gesture-Driven Augmented Reality Interaction System” proposes a multimodal stack with ViT, TCN, and GAT encoders, federated learning, and reinforcement-learning-based interface adaptation, reporting recognition F1 of 0.94, task success of 92%, and accessibility score 0.88 in its experimental setting, while also noting that the abstract’s efficiency and satisfaction gains are not fully supported by the body text (Wang, 18 Jun 2025). “Semantic Reality” shifts the emphasis to a scene-anchored semantic graph of objects and relations, using staged multimodal reasoning, persistent anchoring, and typed inter-object connectivity overlays; in an exploratory study, participants reported clearer inter-object understanding and higher engagement and satisfaction without increased workload (Liu et al., 6 Apr 2026).
Taken together, these papers show that the name “Interact2Ar” circulates across two distinct technical neighborhoods. In (Ruiz-Ponce et al., 22 Dec 2025), it names an autoregressive diffusion architecture for text-conditioned dyadic motion generation. In the AR papers, it appears as a reference point for systems concerned with bi-directional virtual-physical coupling, safety-gated collaboration, marker-centered multi-user interaction, accessibility-aware input, or relation-centric visualization. The overlap is therefore nominal and conceptual rather than methodological.