Create a Video View Paper

GameSight: Knowledge-Enhanced Visual Reasoning for Soccer Commentary

This presentation explores GameSight, a groundbreaking two-stage framework that transforms automatic soccer commentary generation from simple anonymized text into entity-rich, statistically informed narratives rivaling human broadcasters. By solving the dual challenge of visual entity alignment and knowledge integration, GameSight demonstrates how AI can reason compositionally over video, context, and historical statistics to produce commentary with the depth and precision audiences expect from live sports coverage.

Script

When you watch a soccer match on television, commentators seamlessly weave together what's happening on the pitch with player statistics, team context, and tactical insight. Current AI commentary systems can't do this—they produce anonymized, shallow descriptions that miss the entities, context, and knowledge depth that make commentary informative. GameSight changes that by treating commentary generation as a knowledge-enhanced visual reasoning problem.

Traditional end-to-end models fail because they treat commentary as pure text generation, outputting generic descriptions like "a player scored" instead of "Martinez converts his third goal this season." The researchers found that 84.5% of accurate entity identification by humans required analyzing complex visual cues—faces in close-ups, jersey numbers, and match context—not just tracking players across frames.

GameSight solves this through a two-stage architecture, beginning with visual reasoning for entity alignment.

Stage one deploys fine-grained shot analysis across long views, medium shots, and crucial close-ups where faces and numbers become visible. A chain-of-thought reasoning process composes evidence compositionally—considering the visual signal, the current match timeline, player roles, and recent events. A Q-former-based mechanism grounds commentary to specific frames, handling the temporal complexity when close-up shots arrive seconds after the actual event.

With entities correctly identified, stage two injects the knowledge that transforms description into true commentary.

The system mirrors how human commentators work, maintaining an internal database of match state and querying external statistics on demand. Commentary structure shifts from pure description—which human broadcasts keep below 50%—toward explanation and comment. For goals, external statistical accuracy reaches 81.8%, while internal context accuracy hits 98.76%, enabling references like "his second yellow card" or "their first corner in 20 minutes."

This example illustrates the compositional reasoning process in action. The system identifies the shot type, recognizes visual elements like jersey numbers or faces, considers which players are currently on the field, examines the event timeline, and synthesizes these cues step-by-step to assign the correct entity. This multi-step inference, trained through supervised fine-tuning and reinforcement learning, achieves 71.1% player alignment accuracy—an 18.5% improvement over the leading proprietary model.

GameSight demonstrates that generating commentary with the informativeness and precision of human broadcasters requires solving visual reasoning and knowledge integration as distinct, coupled problems. When AI can see the player, know the context, and retrieve the statistics, it begins to comment like it's watching the game. Visit EmergentMind.com to explore this research further and create your own video presentations.