Quantitatively evaluate the accuracy of the OnGoal goal pipeline

Develop and conduct a quantitative evaluation of the accuracy of OnGoal’s three-stage goal pipeline (infer, merge, evaluate) using expert-annotated benchmark datasets to validate goal identification, merging operations, and evaluation categories against ground truth labels.

Background

OnGoal employs a three-stage pipeline that infers, merges, and evaluates conversational goals during multi-turn dialogue with LLMs. The paper focused on user-reported accuracy and usability impacts rather than formal pipeline validation.

A quantitative, benchmark-driven assessment would clarify reliability and generalizability of the pipeline, enabling more rigorous comparisons across models, prompts, and task contexts.

References

However, quantitatively evaluating our pipeline's accuracy, such as on expert-annotated benchmarks, remains untested.

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models (2508.21061 - Coscia et al., 28 Aug 2025) in Section 7.2 Limitations and Future Work