Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Streaming XML Function Token Parser

Updated 13 August 2025
  • Streaming XML function token parser is a mechanism that incrementally extracts XML tokens from LLM outputs using event-driven, SAX-style parsing.
  • It dynamically maps XML tokens to typed function calls, ensuring smooth conversion of semantic text into executable actions.
  • The integrated multi-channel scheduler balances parallel and serial execution, significantly reducing latency in robotics and AI applications.

A streaming XML function token parser is a processing component designed to incrementally and efficiently parse and extract actionable function tokens from streaming XML, particularly in settings requiring immediate and concurrent execution of structured commands. Such architectures have risen to prominence in domains leveraging LLMs for embodied programming, robotic orchestration, and any scenario demanding rapid, on-the-fly conversion from semantic text outputs to executable function calls (Gong et al., 7 Aug 2025). This paradigm leverages event-driven parsing to transform raw token output—typically generated by autoregressive LLMs—into a temporally aligned sequence of XML-structured function tokens, which can then drive synchronous and asynchronous behavior in complex systems.

1. SAX-Style Event-Driven Streaming Parsing

The core of the streaming XML function token parser is a SAX-inspired, event-driven parser embedded directly within the execution shell of the hosting system (Gong et al., 7 Aug 2025). This parser operates over the output stream O=[o1,o2,,on]O = [o_1, o_2, \ldots, o_n] produced in real time by a LLM or another streaming source.

  • Event Model: As each token arrives, the parser incrementally recognizes parse events (e.g., start-element, end-element, text, reference). In the GhostShell framework, this mapping is formalized as a transformation P(O)S\mathcal{P}(O) \to S, where S=[s1,s2,...,sl]S = [s_1, s_2, ..., s_l], sis_i drawn from {AFToken,RFToken,SCFToken,FChar,FRef}\{\text{AFToken}, \text{RFToken}, \text{SCFToken}, \text{FChar}, \text{FRef}\}.
  • Streaming Buffering: Incomplete XML elements are buffered until sufficient tokens have arrived to construct a fully-formed function token or action element; no global document buffer is needed, ensuring minimal latency and constant or near-constant space in practice.

Token Types and Construction:

Token Type Purpose XML Analogy
AFToken Activates a stateful function call Start Tag
RFToken Resets/releases stateful function End Tag
SCFToken Atomic, self-terminating function call Empty Element
FChar/FRef Literal/encoded text or references Character Data

Elements are constructed as E::=FCharFRefSCFToken(AFTokenElementsRFToken)E ::= \text{FChar} \mid \text{FRef} \mid \text{SCFToken} \mid (\text{AFToken} \cdot \text{Elements} \cdot \text{RFToken}).

This event-driven mechanism allows the parser to initiate function calls or behaviors as soon as an actionable XML element is parse-complete—well before the full model output is available.

2. Dynamic Function Interface Mapping

Following parsing, each XML element EE is mapped to a function interface through a dynamic interface mapper. This step translates structural XML token information (tag names, attributes, nesting) into concrete programming actions and arguments for an embodied system (Gong et al., 7 Aug 2025).

  • Attribute-to-Parameter Alignment: Each function parameter (FParam) extracted from XML attributes is typed and converted using an operator T()T(\cdot) that ensures the stream-provided string matches the expected type of the callable interface.
  • State Transition Model: The mapper models the execution as a state transition:

Σ0AFTokenΣactiveRFTokenΣ0\Sigma_0 \xrightarrow{\text{AFToken}} \Sigma_\text{active} \xrightarrow{\text{RFToken}} \Sigma_0

This enables proper tracking of function activation lifecycles, including nested or recursive calls.

  • Formal Mapping Expression: The mapping is denoted M(E)=(F,θ)M(E) = (F, \vec{\theta}), where FF is the function and θ\vec{\theta} is the vector of typed arguments.

This layer decouples declarative semantic markup (XML) from the imperative function call interface, enabling flexible adaptation even as target APIs or action sets evolve.

3. Multi-Channel Scheduling and Serial-Parallel Coordination

After mapping, a multi-channel scheduler coordinates execution across multiple logical or physical subsystems ("channels") (Gong et al., 7 Aug 2025). This module is responsible for orchestrating concurrency and ensuring required synchronization semantics:

  • Intra-channel Synchronous Execution: Functions assigned to the same channel CiC_i (e.g., a manipulator or speech engine) are queued and executed sequentially. This is represented as Γ(Fi,Fj)\Gamma(F_i, F_j).
  • Inter-channel Asynchronous Execution: Functions mapped to disjoint channels can execute concurrently, denoted Λ(Fi,Fj)\Lambda(F_i, F_j).
  • Global Blocking: The scheduler enforces blocking when functions are mapped to a main channel C0C_0, ensuring serializability when cross-component synchronization is critical:

Fi,Fj with α(Fi)=C0j>i    Γ(Fi,Fj)\forall F_i, F_j \text{ with } \alpha(F_i) = C_0 \land j > i \implies \Gamma(F_i, F_j)

  • Textual Token Handling: Text/IP data (from FChar, FRef) is scheduled as blocking actions on the main channel by default.

This enables execution plans in which, for example, "stand up" (on channel B\mathcal{B}: body) and "play audio" (on channel A\mathcal{A}: audio) can proceed in parallel, while complex textual output sequences are serialized as needed.

4. Performance Evaluation and Quantitative Results

Comprehensive grounded evaluations were performed on real robotic hardware (COCO) across 34 interactive tasks types, spanning sequential, parallel, and event-driven flows (Gong et al., 7 Aug 2025). Key reported metrics:

  • Behavioral Correctness Metric (BCM): Achieved a mean BCM of 0.85 with Claude-4 Sonnet, where this metric weights correctness by task complexity. Task normalization is achieved via explicit equations (see Equations 1 and 2 in the original).
  • Response Latency: Demonstrated up to 66× lower response times for function invocation compared to LLM native batch-style function calling APIs.
  • Long-horizon Performance: The architecture proved robust and generalizable across long, multimodal task horizons.

These results support the claim that the streaming parser enables efficient "reasoning-while-acting," minimizing total system latency by beginning action execution as soon as the required tokens become available.

5. Broader Applications and Implications

The utility of the streaming XML function token parser extends beyond robotics:

  • Digital Human Avatars and Game Agents: Streaming parsing and function dispatch allows dynamic, context-sensitive behaviors controlled by real-time LLM output.
  • General AI Agents: The unified approach for text and function output via the XML stream enables "streaming programming" for complex, domain-agnostic settings, e.g., real-time web services, multi-agent systems.
  • Embodied Interaction and Feedback: The architecture is amenable to the integration of feedback or self-correction "loops" for error detection, retry, or adaptation—a direction flagged for future research.

Strategic extension of this architecture could encompass virtual embodiments, enriched event-driven error handling, and adversarial input detection.

6. Architectural Significance and Research Directions

The architectural composition—event-driven XML parsing, dynamic type-checked interface mapping, and multi-channel scheduling—realizes an efficient pipeline for low-latency, concurrent stream-to-function-call interpretation at scale (Gong et al., 7 Aug 2025). This paradigm:

  • Bridges semantic LLM output and immediate real-world execution.
  • Provides compositionality suitable for both tightly integrated embodied systems and distributed agent collectives.
  • Defines a template for streaming, incremental program synthesis and execution by integrating declarative data (XML tokens) with imperative programmatic control (function calls).

This suggests that future systems leveraging LLMs for closed-loop environments will broadly benefit from the event-driven, streaming token parser approach, particularly where low-latency, interleaved observation and actuation are critical. A plausible implication is a shift toward self-timing, streaming pipelines not just for embodied robotics but any domain subject to continual reasoning and real-time program synthesis.

Summary Table: Streaming XML Function Token Parser Components

Component Role Distinctive Feature
Streaming Parser Incremental SAX-style XML function token extraction Minimal-latency, event-driven output
Interface Mapper Dynamically maps XML tokens to concrete function calls Type-checked, schema-aligned arguments
Scheduler Orchestrates parallel and serial execution over channels Serial-parallel hybrid scheduling/coordination

In summary, the streaming XML function token parser, as instantiated in the GhostShell system, is a key enabler of low-latency, concurrent behavioral programming from streaming LLM output to real-world or virtualized action (Gong et al., 7 Aug 2025). It defines a rigorous event-driven mechanism for parsing, mapping, and executing structured function calls in diverse high-throughput, interactive AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)