User Journey Coverage Score (UJCS)
- User Journey Coverage Score (UJCS) is a metric suite that quantitatively evaluates agents' adherence to defined, multi-step processes in structured domains such as customer support and workflow automation.
- It employs graph-based SOP models to measure stepwise tool call accuracy, parameter matching, and policy compliance, ensuring rigorous process alignment.
- UJCS also supports use case recovery by aggregating actor correctness, naming accuracy, and path fidelity, offering granular insights into functional and sequential compliance.
The User Journey Coverage Score (UJCS) is a family of automated evaluation metrics designed to quantify the adherence of agents or generative systems to specified, multi-step processes (user journeys) over structured domains such as customer support, workflow automation, and use case recovery. UJCS accommodates both policy-aware agent evaluation and use case extraction by rigorously scoring system outputs against a reference, often graph-based, specification. By enforcing stepwise, parameter-sensitive, and coverage-aware correctness, UJCS enables granular assessments of functional compliance and process alignment. Notably, UJCS has been instantiated in two domains: policy execution for customer support LLM agents (Balaji et al., 2 Jan 2026) and use case recovery from software artifacts (&&&1&&&), both featuring domain-appropriate mathematical formalizations unified by a focus on strict sequential and coverage fidelity.
1. Formal Definitions and Metric Variants
There exist two primary variants of UJCS:
- Policy Compliance for LLM Agents (Balaji et al., 2 Jan 2026): Here, UJCS quantifies the fidelity with which an agent executes a standard operating procedure (SOP), represented by a task graph, over simulated conversations.
- Use Case Recovery Coverage (Xiao et al., 15 Dec 2025): UJCS aggregates multiple aspects of alignment when comparing recovered use cases to ground truth, emphasizing actor, naming, path structure, and coverage.
1.1. Policy-Adherence UJCS
Given an SOP-encoded expected trace
and an agent-produced actual trace
UJCS is computed by:
- Tool Call Accuracy per conversation:
- UJCS Aggregation:
1.2. Use Case Recovery UJCS
UJCS is defined as the equally-weighted mean of four sub-scores:
- Actor correctness
- Name accuracy
- Path fidelity
- Behavioral coverage
Let be the set of reference use cases and the number of missed use cases. Then:
- Omission rate:
- Behavioral coverage:
- Sub-score aggregation:
with mean values taken across matched use cases.
2. Scoring Dimensions and Mathematical Details
UJCS metrics operate by enforcing both stepwise and holistic system compliance across policy and structure dimensions.
2.1. Tool Trace Alignment and Parameter Accuracy
UJCS for policy execution is highly sensitive to trace order and parameter matching. The metric assigns zero if the agent skips, reorders, or misnames any tool call; otherwise, it computes the fraction of correctly supplied parameters, allowing fine-grained error attribution (Balaji et al., 2 Jan 2026).
2.2. Semantic and Structural Alignment
In use case recovery:
- Actor correctness (): Weighted combination of SBERT-based semantic similarity and role taxonomy alignment.
- Name accuracy (): Cosine similarity over verb and noun phrases.
- Path fidelity (): Jaccard index over directory segments.
- Behavioral coverage (): Proportion of ground-truth use cases matched (Xiao et al., 15 Dec 2025).
Weighting for each sub-score, as well as detailed computation, is precisely defined in the original sources.
3. Interpretation and Diagnostic Guidance
UJCS scores span the unit interval , where higher values indicate stronger adherence or coverage.
| UJCS Range | Interpretation |
|---|---|
| 1.0 | Perfect compliance: no step skipped, reordered, or mis-parameterized |
| 0.8–1.0 | Strong adherence, minor slips |
| 0.5–0.8 | Moderate compliance, partial coverage |
| <0.5 | Poor adherence, substantial errors or omissions |
A plausible implication is that UJCS thresholds (e.g., 0.9) can act as cutoffs for deployment readiness, as suggested in business-facing scenarios (Balaji et al., 2 Jan 2026), while for recovery tasks, diminishing UJCS strongly correlates with both process and semantic drift (Xiao et al., 15 Dec 2025).
4. Methodological Workflow
The computation of UJCS requires deterministic ground-truth modeling, structured output logging, and aggregation logic:
- Policy Evaluation:
- Encode the SOP as a DAG with tasks, inputs, and branching.
- Generate reference traces by BFS enumeration.
- Simulate agent-user interaction and log tool calls.
- Compute per-conversation .
- Average across reflections to obtain UJCS (Balaji et al., 2 Jan 2026).
- Use Case Recovery:
- For each ground-truth use case, match to the most similar candidate.
- Compute , , .
- Aggregate omission information into .
- Form UJCS via arithmetic mean of sub-scores (Xiao et al., 15 Dec 2025).
Automation of trace extraction, branch condition simulation, and systematic error analysis are procedural recommendations.
5. Strengths, Limitations, and Cross-Domain Properties
UJCS enforces strict sequence and parameter validity, capturing multi-step logic and branching in complex SOPs (Balaji et al., 2 Jan 2026), and multidimensional fidelity in use case recovery (Xiao et al., 15 Dec 2025). Its design achieves:
Strengths:
- High sensitivity to full-path adherence.
- Capability to dissect errors at parameter, action, or abstraction level.
- Applicability to deterministic workflow scenarios.
Limitations:
- Unforgiving to minor reordering or synonymous variants—trace misalignment results in a zero score.
- Ignores aspects unrelated to structural compliance, such as dialog quality or recovery from errors.
- Assumes deterministic conditions; substantial real-world nondeterminism may require adaptive extensions.
- In use case extraction, high domain specificity and codebase modularity can depress UJCS by increasing omission and misalignment rates (Xiao et al., 15 Dec 2025).
This suggests UJCS is best used in concert with metrics targeting complementary desiderata such as conversational quality or user satisfaction.
6. Practical Application and Domain Insights
For policy-adherence evaluation, deployment requires: (i) reliable SOP graph construction; (ii) exhaustive generation of valid user journeys; (iii) deterministic simulation of agent interactions; and (iv) comprehensive, structured agent output logging (Balaji et al., 2 Jan 2026). Use case recovery demands multi-stage semantic, syntactic, and behavioral matching, with acknowledged sensitivity to abstraction variance and domain complexity (Xiao et al., 15 Dec 2025).
Empirically, well-structured, shallow task domains consistently yield high UJCS (≈ 0.8–0.9); deep, multi-module, or domain-specific projects typically realize significantly lower scores (0.3–0.6) due to increased path and role ambiguity (Xiao et al., 15 Dec 2025).
UJCS thresholds can serve as regression gates, deployment criteria, or drift monitors. It is recommended to validate SOP or reference-set accuracy whenever business or system changes occur.
7. Metric Comparisons and Future Directions
UJCS extends beyond conventional task-completion and tool-selection metrics by demanding stepwise and parameter fidelity rather than goal-only checks (Balaji et al., 2 Jan 2026). Tool Trace Alignment operates as a binary precursor, while UJCS admits partial parameter correctness, admitting finer error granularity. In current literature, no formal correlation between UJCS and external subjective metrics (e.g., conversational prowess) has been established.
Future refinement directions may include adaptation for nondeterministic environments, incorporation of partial trace equivalence, or hybridization with quality-focused and human-centric ratings, particularly for mixed-initiative or dialog-mediated workflows.