Papers
Topics
Authors
Recent
Search
2000 character limit reached

VibeServe: Tele-Immersive & LLM Serving

Updated 4 July 2026
  • VibeServe is a dual-use term representing both a tele-immersive system for real-time 3D point clouds with haptic feedback and an AI system for generating specialized LLM serving runtimes.
  • The tele-immersive design integrates synchronized sensors, edge processing, and rendering to achieve a sub-100ms latency, ensuring dynamic, high-fidelity user experiences.
  • The LLM-serving approach employs a multi-agent loop to tailor deployment pipelines to specific model-hardware-workload combinations, optimizing throughput and latency.

VibeServe is a name used in the supplied arXiv literature for two distinct technical systems. In one usage, it denotes a detailed design for real-time streaming of 3D point clouds and vibrotactile feedback, described as a full “VibeServe” reference design and explicitly characterized as “heavily inspired by Matsumoto et al.’s tele-immersive system” in "The Stage Comes to You: A Real-Time Tele-Immersive System with 3D Point Clouds and Vibrotactile Feedback" (Matsumoto et al., 8 Oct 2025). In the other usage, VibeServe is "the first system that uses a multi-agent loop to generate an end-to-end, deployment-specialized LLM serving runtime automatically," as introduced in "VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?" (Kamahori et al., 7 May 2026). The shared label therefore spans both telepresence-oriented multimodal streaming and agentic synthesis of bespoke inference infrastructure.

1. Scope and disambiguation

The two uses of the name occupy different research domains, expose different system boundaries, and optimize different objective functions. The tele-immersive usage concerns end-to-end delivery of sight and haptics under a sub-$100$ ms design goal, whereas the LLM-serving usage concerns generation-time specialization of serving stacks for a particular model-hardware-workload triple.

Usage of “VibeServe” Domain Core object
Tele-immersive VibeServe 3D point-cloud and vibrotactile telepresence Reference design and client stack
LLM-serving VibeServe LLM systems and agentic software synthesis Multi-agent loop for bespoke runtimes

This naming overlap is technically significant because the term does not identify a single lineage of methods. A plausible implication is that arXiv identifier and local context are necessary for precise disambiguation whenever “VibeServe” appears in citation graphs, implementation notes, or systems comparisons.

2. Tele-immersive VibeServe as a multimodal streaming system

In the tele-immersive usage, the underlying system is a low-latency tele-immersive entertainment system that streams 3D point clouds and performers’ footstep vibrations, creating the sense that the stage is present; moving performers and their surroundings are captured as dynamic point clouds under rapidly changing lighting, then processed, transmitted, and rendered within a total latency of less than $100$ ms (Matsumoto et al., 8 Oct 2025). Under high ambient noise, footstep vibrations are sensed by wearable accelerometers, and real-time visual and haptic streams are delivered to a remote venue, where a large 3D LED wall and a vibration-efficient haptic floor envelop dozens of spectators. A public trial at Expo 2025 linked sites $20$ km apart: visitors watched a live dance show and conversed with performers without noticeable delay.

The detailed VibeServe design organizes the system into capture, edge processing, network, and rendering subsystems. On the capture side, it specifies 3×3\times $128$-channel 360360^\circ LiDARs ($10$ Hz, ±3\pm 3 cm accuracy), 1×1\times $1080$p global-shutter RGB camera ($100$0 fps), $100$1 wearable shoe-mount module (3-axis accelerometer + preamp + BLE/Wi‑Fi radio), and a PTP-synchronized hardware trigger for camera/LiDAR alignment. The edge processing node is a GPU-equipped host running C++/CUDA for point-cloud fusion on real-time Linux (PREEMPT_RT), with JPEG encoding for RGB, QOI lossless encoding for $100$2-bit depth, a gait-periodicity gate and EQ filter for accelerometer signals, and RTP/UDP packetization for visual and haptic streams.

The network model is equally explicit: dedicated fiber at $100$3 Gbps, UDP+RTP transport, DSCP priority for haptics, and PTP clock distribution end-to-end. The venue-side rendering system, named the “VibeServe” client in the design, comprises a 3D LED wall with stereo $100$4 per eye, a raised-access floor with $100$5 electrodynamic actuators, a Unity-based controller mapping dancer position to actuator patterns, and a point-cloud renderer based on OpenGL/Vulkan with JPEG/QOI decoding. Taken together, these elements define a reference architecture rather than a single algorithmic novelty; its significance lies in the integration of sensing, compression, transport, synchronization, and embodied rendering into a single latency budget.

3. Point-cloud reconstruction, compression, and vibrotactile playback

The 3D capture pipeline uses staggered LiDAR sweep phases, with a revisit rate of $100$6 ms yielding an effective $100$7 fps when aligned with the hardware-triggered camera. Per-frame densification proceeds in four steps: project each LiDAR to RGB to form a depth image, perform inter-frame color differencing to obtain a motion mask, apply morphological closing, and fuse static pixels from the previous two frames. The design also applies a view-dependent depth bias,

$100$8

to manage seams in the rendered point cloud.

Compression is asymmetric across modalities. RGB is encoded as JPEG with quality approximately $100$9, while depth is encoded as QOI at $20$0 bit/pixel. The stated data-rate model is

$20$1

where $20$2 points, $20$3 bits after compression, and $20$4 Hz, giving $20$5 Gbit/s per unit before multiplexing. Aggregate throughput across $20$6 units is approximately $20$7 Gbit/s. This places the visual stream in a regime where transport engineering is inseparable from rendering design.

The vibrotactile pipeline is specified with comparable precision. Stage-side sensing uses a 3-axis MEMS accelerometer with $20$8 range and noise $20$9, sampled at

3×3\times0

A gait-periodicity gate detects major envelope peaks above threshold. Spectral equalization uses

3×3\times1

described as a first-order high-pass compensating spectral tilt. Output amplitude is mapped as

3×3\times2

with 3×3\times3 tuned per dancer and typically 3×3\times4–3×3\times5.

Transport packetizes 3×3\times6 kHz samples into 3×3\times7 ms RTP frames over UDP with DSCP EF, and the total footstep stream is approximately 3×3\times8 Mbit/s. Venue-side playback uses 3×3\times9 off-the-shelf voice-coil actuators under modular panels and a two-layer plywood propagation floor for lateral coupling, with no more than $128$0 dB attenuation at the farthest tile. The Unity controller applies either localized or global excitation patterns. The design guidelines recommend wide-area plywood coupling layers to trade actuator count against uniformity, and equalizing the accelerometer spectrum above $128$1 Hz to match footstep timbre. These details indicate that the haptic subsystem is not treated as an auxiliary channel but as a first-class, latency-constrained rendering pathway.

4. Latency accounting, synchronization, and real-time constraints

The tele-immersive design formalizes total end-to-end delay as

$128$2

Measured averages are $128$3 ms, $128$4 ms, $128$5 ms, $128$6 ms, $128$7 ms, and $128$8 ms, yielding $128$9 ms and thus meeting the stated design goal of less than 360360^\circ0 ms (Matsumoto et al., 8 Oct 2025).

Synchronization relies on PTP across capture units and client renders, with sub-microsecond sync and frame-level timestamps inserted in RTP headers. Client-side jitter buffers are asymmetric: visual uses 360360^\circ1 frames, approximately 360360^\circ2 ms, while haptic uses 360360^\circ3 packets, 360360^\circ4 ms. Dynamic buffer-size adaptation is driven by observed jitter. The real-time requirements listed for the implementation include kernel latency budget under 360360^\circ5 ms, end-to-end goal under 360360^\circ6 ms for sight and haptics, and footstep jitter under 360360^\circ7 ms.

The practical design guidance is consistent with these constraints. LiDAR sweep phases should be staggered by 360360^\circ8 to maximize revisit rate; the view-dependent bias should be tuned in the interval 360360^\circ9 per installation to hide seams; haptic packets should be prioritized in network QoS using DSCP EF to keep $10$0 stable under $10$1 ms; and the visual jitter buffer should be kept to at most $10$2 frames to avoid perceptual lag. The system thereby frames real-time tele-immersion as a cross-layer optimization problem rather than a rendering-only problem.

5. VibeServe as generation-time specialization for LLM serving

In the LLM-serving usage, VibeServe is defined as the first system that uses a multi-agent loop to generate an end-to-end, deployment-specialized LLM serving runtime automatically, trading off the one-size-fits-all generality of vLLM, SGLang, and TensorRT-LLM for generation-time specialization tailored to a particular model-hardware-workload triple (Kamahori et al., 7 May 2026). The motivating claim is that traditional serving stacks are general-purpose systems built over many engineer-years around decoder-only Transformers, NVIDIA GPUs, and generic chat workloads, whereas new model families, new hardware, and non-standard workloads impose a portability tax through suboptimal scheduling, redundant computation, or outright unsupported execution paths.

The cited non-standard cases are concrete: hybrid SSM-attention models, multimodal systems, Apple Silicon, specialized ASICs, speculative code edits, retrieval-augmented generation with long shared prefixes, and streaming ASR. Generic systems are said to incur per-token sampler and filter overhead because they cannot assume properties such as predicted-outputs speculation or deterministic spans in JSON schemas, to waste compute or memory under very long shared prompts such as $10$3k tokens, and to break under CUDA-centric assumptions when deployed on Apple Silicon or mixed AR-diffusion micro-architectures such as Show-o2.

Architecturally, VibeServe factors search into an outer planning loop and an inner implementation loop. The outer loop reasons over git-recorded commits and a persistent issue-tracker “long-term memory.” The inner loop edits code, checks correctness, and measures performance on the target benchmark. The roles are explicitly separated. The Implementer applies patches in a fresh agent context, drawing from a Skills Library of optimization patterns including CUDA graphs, paged KV caches, XGrammar masks, and MLX prefill-chunk tuning. The Accuracy Judge runs the user-provided checker, verifies acceptance criteria, and scans for reward-hacks such as constant templates and shortcut returns. The Performance Evaluator measures throughput $10$4, latency $10$5, or TTFT, uses profilers such as Nsight and PyTorch Profiler, isolates hotspots, and emits hints for the next planning round.

Candidate synthesis starts from a HuggingFace reference and may rewrite batching, memory management, request scheduling, and hardware-specific code paths such as CUDA versus MLX on Apple Silicon. The issue specification can require interventions like CUDA-graph capture on prefill or integration of an XGrammar bitmask for JSON-schema-constrained decoding. Agents may also change configuration knobs including prefill chunk size, draft-model size $10$6, quantization schemes, and classifier-free guidance stride. The performance metrics are given as

$10$7

with TTFT measured per request end-to-end, and speedup reported either as

$10$8

The paper does not encode a dollar cost model, but states that hardware-time pricing could in principle be incorporated into the outer loop’s metric.

6. Benchmark regimes, empirical results, and the specialization frontier

The LLM-serving evaluation covers six scenarios, labeled A through F, each with model weights and HuggingFace reference, an accuracy checker, a workload benchmark script, and natural-language instructions (Kamahori et al., 7 May 2026). Scenario A evaluates standard Llama-3.1-8B on H100 against vLLM and SGLang using throughput $10$9, TTFT, and TPOT at arrival rates ±3\pm 30 requests per second. After ±3\pm 31 iterations, VibeServe matches vLLM with ±3\pm 32 and exceeds SGLang by approximately ±3\pm 33 on throughput and approximately ±3\pm 34 on TTFT. This is presented as evidence that generation-time specialization need not sacrifice performance even where existing stacks are already mature.

Scenarios B through F target opportunities that generic stacks miss. In B, Qwen3-32B code editing with predicted outputs is compared against vLLM autoregressive baseline ±3\pm 35 and vLLM speculative decoding baseline ±3\pm 36; VibeServe reaches ±3\pm 37 over ±3\pm 38 iterations by integrating a user-draft verifier in ±3\pm 39-token blocks and fine-tuning block size and acceptance bookkeeping. In C, Olmo-Hybrid-7B prompt caching on L4 exploits dual caches—paged KV for attention and snapshots for SSM layers—plus CUDA graphs, yielding 1×1\times0. In D, Moonshine Streaming ASR exposes a per-stream sliding-window encoder cache plus CUDA graphs, yielding a 1×1\times1 TTFT reduction.

The hardware-specific scenarios are equally explicit. In E, JSON constrained decoding on Apple M3 Pro improves over an mlx_lm baseline without XGrammar or speculation from 1×1\times2 s p50 latency to 1×1\times3 s, a 1×1\times4 speedup, by layering XGrammar masks, speculative decoding with a 1×1\times5B draft at 1×1\times6, and prefill chunk tuning. In F, Show-o2 unified vision-language serving improves on H100 from 1×1\times7 ms to 1×1\times8 ms, 1×1\times9 faster, via CUDA graphs, layout fusion, prefix trimming, and head trimming; on MacBook, the baseline PyTorch-MPS time of $1080$0 s is reduced to $1080$1 s, a $1080$2 speedup, via MLX port, prefix caches on body and head, classifier-free guidance stride $1080$3, and kernel-peak tuning.

The paper’s discussion frames these results as a trade-off between generation-time specialization and runtime generality. Search overhead ranges from $1080$4 h for JSON decoding to $1080$5 h for standard Llama on H100. Runtime generality still wins when a single stack must serve thousands of targets with minimal per-target overhead, whereas VibeServe is characterized as best when each deployment justifies its own search. The stated limitations are single-seed experiments, dependence on a user-provided correctness checker, and a non-trivial compute budget; open questions include direct optimization of dollar cost per token, concurrent scaling of one agentic pipeline to many diverse targets, and automatic detection of when a generic stack suffices versus when bespoke search is warranted.

A recurrent misconception would be to treat “VibeServe” as a single research artifact. The supplied literature instead uses the label for both a tele-immersive reference design associated with dynamic point clouds and vibrotactile floors, and an agentic system for synthesizing specialized LLM serving runtimes. This suggests that the term is best understood as context-dependent, with the substantive technical content determined by the surrounding architecture, metrics, and cited arXiv identifier.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VibeServe.