ExAct: A Video-Language Benchmark for Expert Action Analysis (2506.06277v1)

Published 6 Jun 2025 in cs.CV

Abstract: We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/

Summary

The paper introduces the EXACT benchmark, a novel video-language evaluation system using 3,521 VQA pairs from 11 skilled activities across six domains.
It highlights a marked performance gap with models like GPT-4o achieving 44.70% accuracy compared to 82.02% by human experts.
The study demonstrates the need for enhanced VLMs for fine-grained action assessment and accurate expert commentary in complex activity domains.

Expert Action Assessment: The EXACT Benchmark

The paper "EXACT: A Video-Language Benchmark for Expert Action Analysis" introduces a novel benchmark, EXACT, engineered to evaluate the expert-level understanding of skilled human activities. The authors present an extensive dataset comprising 3,521 video-question-answer (VQA) pairs across 11 skilled activities within six domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. This benchmark arises from the need for a more nuanced evaluation tool, enabling a sophisticated understanding of skilled human actions that traditional Vision-LLMs (VLMs) currently lack.

Primarily, the benchmark addresses two significant deficiencies in existing models: inadequate visual representations that fail to encapsulate expert-level knowledge crucial for skill learning and the lack of benchmarks focusing on fine-grained action understanding rather than generic activity recognition. Most previous datasets rarely require a subtler understanding of skills, usually providing only a coarse categorization of activities.

The EXACT benchmark is structured using a multiple-choice question format, requiring participants to discern the correct expert commentary from distractors. The GAUGE of this benchmark is based on the model's ability to match human expert performance, quantified through accuracy. Presently, cutting-edge VLMs like GPT-4o perform substantially lower than human experts, with GPT-4o achieving a 44.70% accuracy compared to the 82.02% accuracy attained by human experts. Non-expert humans surpass even the best-performing VLMs, highlighting the current gap between machine understanding and human expertise.

The paper delineates the careful construction of the EXACT benchmark through a four-stage pipeline to ensure high-quality evaluation samples. It involves pre-processing transcribed expert commentaries to concise, coherent feedback suitable for evaluation, followed by the generation of multiple-choice question-answer pairs, filtering out biases, and ultimately, validation by domain experts.

Several experimental evaluations were conducted using state-of-the-art VLMs, revealing their performance deficiencies when tasked with expert-level understanding of physical human skills. GPT-4o and Gemini 1.5. Pro emerged as the leading models, though they still substantially lag behind human experts in comprehending fine-grained action sequences, especially in domains like Sports, Music, and Dance. The results reflect the models' struggles in analyzing complex physical tasks requiring precise temporal coordination and perceptual accuracy.

The detailed categorization between 'Good Execution' and 'Tips for Improvement' in the expert feedback highlights areas where current models either interpret skilled execution incorrectly or fail to provide meaningful improvement advice that aligns with expert standards. Notably, the proprietary models were found to be more proficient in handling the latter category, which demands a more robust comprehension of execution errors and skillful action adjustments.

A key observation is that the scale and architecture of VLMs notably impact their performance on the EXACT benchmark. Larger models exhibit a more sophisticated ability to process detailed video representations; however, the overall state-of-the-art remains distant from achieving expert-level action understanding.

Finally, the paper articulates the potential limitations of its benchmark, such as the limited scope to generalized real-world activities and the ethical concerns related to handling personally identifiable video data. Regardless, EXACT sets the stage for future advancements in VLMs aimed at bridging the performance gap between machines and humans in expert-level skill understanding, promoting the development of virtual coaching systems and improved AI-based learning assistance.

In conclusion, the EXACT benchmark offers a rigorous evaluation framework that elucidates significant opportunities for development in the VLM landscape, advocating for innovations bolstered by finer-grained, expert-aligned insights into human skilled activities. The insights drawn from this paper will be instrumental in guiding research directions within this space, urging an elevation of model capabilities to achieve an adept understanding synonymous with human expertise.