EngageSync: Adaptive Engagement in VR

Updated 15 June 2026

EngageSync is a context-adaptive platform offering synchronous transcript delivery and engagement analysis for immersive VR meetings.
The system employs a low-latency client-server architecture with real-time ASR, LLM summarization, and avatar-anchored interfaces to optimize user interactions.
Empirical evaluations show enhanced social presence, faster re-engagement, and improved information recall in VR settings.

EngageSync is a context-adaptive, synchronous engagement and re-engagement platform originally introduced and evaluated for immersive virtual reality (VR) meetings. It integrates real-time context-aware transcript delivery, avatar-anchored interface elements, physiological and behavioral engagement analysis, and relevance-driven data synchronization structures. EngageSync advances the state-of-the-art in collaborative systems, grounding its design in empirical results from VR studies, computational engagement modeling, synchronization research in both social and computational domains, and modern client-server synchronization architectures (Lee et al., 20 Mar 2025, Vedernikov et al., 2024, Sanlaville et al., 2015, Kožusznik, 2018, Saavedra et al., 2011).

1. System Architecture and Context Awareness

EngageSync employs a distributed client-server architecture, optimized for VR settings but extensible to other synchronous, real-time group contexts (Lee et al., 20 Mar 2025). Each client (e.g., running on Meta Quest Pro HMD) streams microphone audio, eye/gaze position, and manual interaction events (such as pinch gestures) via a low-latency network stack (Photon Fusion + Photon Voice). Audio is transcribed in real time by a cloud ASR system (Google Speech-to-Text), then relayed to a "Text Server" for summarization and context routing.

The Text Server orchestrates both direct transcript streaming and engagement-based adaptations. It receives utterances tagged with speaker identity, batches and prompts an LLM (OpenAI GPT-4-Turbo) for concise summaries, and redistributes both live and summarized text to all clients, tagging each utterance by avatar network ID and context (live, last-utterance, or re-engagement summary).

Clients render panels that are spatially anchored above avatars. State transitions (see Section 3) determine which interface element is shown, with additional logic for summary expiration, attention/focus management (panel auto-fade if not gazed at for 2 s), and interaction via gaze and gesture.

2. Engagement Detection and Synchronization

Continuous engagement detection leverages both behavioral and physiological signals, as shown in VR and online meeting deployments (Lee et al., 20 Mar 2025, Vedernikov et al., 2024). EngageSync distinguishes three principal modes using gaze and activity streams:

Focused on Speaker: Gaze vectors (using Meta Movement SDK) intersect a speaking avatar; the corresponding avatar's panel displays a live, ASR-generated transcript.
Focused on Listener: Gaze intersects a non-speaking avatar; the panel shows a concise, LLM-generated summary of the avatar's last utterance.
Disengaged/Re-engaging: Gaze is off all avatars/objects for >2 s, indicating dropout. On gaze return, the Re-engagement Mode is activated, attaching summaries of all missed utterances to the corresponding avatars.

Speech activity is detected by envelope thresholding. Pinch-gesture combined with gaze triggers panel summon; summary “read” is detected via gaze linger (>1.5 s).

For engagement modeling in online meetings, video-based remote photoplethysmography (rPPG) with unsupervised contrastive learning extracts heart rate variability (HRV) features at 94% accuracy. Behavioral features—facial action units, gaze, head pose—are extracted and fused with HRV, enabling high-precision engagement classification (96% accuracy) updated every few seconds, suitable for group-level synchronization monitoring (Vedernikov et al., 2024).

3. Adaptive Transcript and Summarization Mechanism

EngageSync’s core UI innovation is the context-driven, avatar-anchored transcript panel (Lee et al., 20 Mar 2025). Unlike conventional table-fixed interfaces that disrupt gaze and social presence, EngageSync adapts its transcript content and presentation:

Engagement Mode: When actively participating (gaze is on speaker or listener avatars), users receive either live transcriptions or a summary of the last utterance, attached directly to the relevant avatar.
Re-engagement Mode: Upon return from a disruption, EngageSync automatically generates and displays per-avatar summaries (15-word limit) of all missed content, which auto-expire after being read.

Summarization is performed by an LLM with empirically optimized prompt templates ("Summarize the following utterance in no more than 10 words", or 15 for re-engagement), yielding 89–90% summary accuracy against manual reference (Lee et al., 20 Mar 2025). End-to-end latency is typically 1.3 s (ASR median 528 ms, LLM 803 ms). Error rates for summary compression are ≈2%.

Color-coded badges signal different transcript types: live (none), engagement summary (green), re-engagement summary (orange); all panels are spatialized above avatars for efficient access and minimal distraction.

4. Principles of Synchrony and Real-Time Correlation

EngageSync draws directly on empirical and theoretical results showing the centrality of synchronicity and mutual adaptation in effective distributed teamwork (Saavedra et al., 2011, Sanlaville et al., 2015). In financial markets, spontaneous, leaderless synchronization—measured by per-person z-score $s_{ij}$ reflecting second-by-second co-occurrence with peer actions—was shown to predict individual success and was tightly coupled to communication behaviors (instant messaging patterns). These principles generalize: synchrony in collective action, not just in communication or attention, yields performance gains.

By extension, EngageSync can incorporate real-time group synchronicity metrics:

For each user and second, compute activity alignment scores relative to peers.
Dynamically overlay heat-maps, crowd-ticker signals, or decision overlays when group alignment exceeds chance—enabling real-time feedback on team-level synchrony.

In adaptive HCI systems, synchrony and rapport emerge from reciprocal, continuous adaptation of low-level behavioral rhythms (e.g., turn-taking), operationalized via FSMs and hidden Markov models with social attitude variables (liking, dominance) (Sanlaville et al., 2015).

5. Data Synchronization and Relevance Filtering

The client-server transactional model underpinning EngageSync leverages advanced relevance-driven data synchronization frameworks (Kožusznik, 2018). Rather than full-dataset syncs, EngageSync can maintain for each user a minimal, structurally constrained, per-role data slice defined by a grammar of path expressions. The server, upon a sync request including a user identifier and last-sync timestamp, computes which relevant objects/links have changed since the last update and returns only those deltas. Clients then:

Apply deletions, then additions, then updates to their local view.
Queue offline modifications for later upload with locally unique transaction stamps.
Optionally, utilize set-reconciliation protocols (e.g., MinHash, characteristic polynomial) to optimize bandwidth.

This architecture keeps per-user storage and network usage minimal. It also inherently supports multi-device, multi-tenant deployments and provides a substrate for fine-grained event and content synchronization (e.g., missed utterances, engagement state changes).

6. Evaluation, Empirical Findings, and Design Recommendations

In controlled studies of VR meetings, EngageSync significantly increased social presence (assessed by Networked Minds inventory: co-presence and attentional allocation), accelerated information recall and re-engagement after disruption, and increased the proportion of gaze time spent on peer avatars versus UI compared to both table-fixed and always-on avatar-fixed alternatives (Lee et al., 20 Mar 2025). Key statistical findings include:

Social presence: EngageSync outperformed baselines with χ² = 14.13, p < .001 (W = .471) in combined groups.
Re-engagement time: EngageSync fastest across conditions (χ² = 22.20, p < .001, W = .740).
Information recall: higher in mid-sized groups using EngageSync (χ² = 6.88, p = .028, W = .229; recall of less talkative speakers near ceiling in EngageSync, significantly lower in alternatives).

Formative and summative studies support three primary design recommendations:

Anchor transcripts to avatars to minimize gaze disruption and preserve group social context.
Adapt the transcript content and delivery mode based on detected user context (live participation vs. re-engaging after disruption).
Rely on on-demand (gaze+gesture) interface element presentation rather than clutter-inducing always-on overlays.

7. Limitations and Future Directions

Certain limitations have been noted in current implementations. Fixed panel placement may not be optimal for all users or room configurations. ASR/LLM summarization incurs small, but nonzero, error rates. Only a single dropout duration (4 min) has been systematically studied; real-world session dynamics may necessitate policy adaptation. Summarization prompt refinement and more sophisticated context/ranked retrieval may further enhance catch-up efficiency (Lee et al., 20 Mar 2025). In synchronous engagement modeling, rPPG-based HRV is limited by lighting/motion constraints, and behavioral features degrade under occlusions (Vedernikov et al., 2024).

Potential future work includes integration of end-to-end deep learning pipelines for engagement estimation, dynamic summary ordering and multimodal fusion, group-level synchrony visualization, and extension to diverse platforms and domains (beyond VR and online meetings) (Vedernikov et al., 2024, Kožusznik, 2018).

References:

(Saavedra et al., 2011, Sanlaville et al., 2015, Kožusznik, 2018, Vedernikov et al., 2024, Lee et al., 20 Mar 2025)