ShotQA: Cinematic Language Dataset

Updated 1 July 2025

ShotQA is a multimodal dataset comprising 70,000 QA pairs from acclaimed films that systematically cover eight key cinematic dimensions.
The dataset’s design emphasizes expert cinematic reasoning by requiring models to distinguish subtle visual and narrative film techniques.
ShotQA underpins advanced models like ShotVL, which achieve significant performance gains in cinematic understanding benchmarks.

ShotQA is a large-scale multimodal dataset and evaluation resource specifically constructed for the advancement of expert-level cinematic language understanding in artificial intelligence. As the training foundation for the ShotVL model, ShotQA represents the first dedicated corpus designed to probe and teach vision-LLMs (VLMs) the nuanced compositional, spatial, and stylistic grammar of professional film.

1. Definition, Structure, and Objectives

ShotQA comprises approximately 70,000 high-quality multiple-choice question-answer (QA) pairs based on 58,140 images and 1,200 video clips extracted from 243 critically acclaimed, predominantly Oscar-nominated films. Each QA pair is anchored in a key domain of cinematography, covering a diverse and representative set of professional film language concepts. Every entry includes corresponding metadata such as film title and source timestamp, enabling precise alignment and structured downstream evaluation.

The primary objective of ShotQA is to address the deficit in existing VLM datasets and evaluation protocols relating to expert cinematic reasoning. Its design aims to facilitate models and systems capable not just of object recognition or generic scene understanding, but of analyzing, describing, and leveraging the sophisticated visual and narrative conventions of cinema.

2. Cinematographic Dimensions and Design Principles

ShotQA and its associated benchmark, ShotBench, are organized around eight mutually complementary dimensions of film language recognized by professional practitioners:

Shot Size (e.g., wide, medium, close-up)
Shot Framing (e.g., single, group, over-the-shoulder)
Camera Angle (e.g., high angle, low angle, aerial, Dutch angle)
Lens Size (e.g., wide, ultra-wide/fisheye, long lens, medium)
Lighting Type (e.g., daylight, mixed light, firelight, artificial)
Lighting Condition (e.g., backlight, high contrast, silhouette)
Composition (e.g., centered, left-heavy, short side, symmetrical)
Camera Movement (e.g., push in, pull out, pan/tilt, dolly, zoom)

Each QA pair directly probes one of these dimensions, requiring both fine-grained visual perception and the accurate application of professional terminology. The dataset's question templates and answers are carefully constructed to ensure a balanced and challenging coverage across these facets, with particular attention to fine distinctions (e.g., the difference between "medium shot" and "medium close-up").

3. Empirical Evaluation: Limitations of Contemporary VLMs

Extensive benchmarking with ShotBench, leveraging these dimensions and a subset of ~3.5k expert-annotated QAs, revealed substantial deficiencies in the cinematic reasoning capacities of current leading VLMs:

The highest average accuracy attained by any model (GPT-4o) was 59.3%, with most open-source and proprietary models scoring ~50% or lower.
Models consistently struggled to distinguish visually similar but technically distinct categories (e.g., "pull out" versus "zoom out" camera movement), and often confused subtle spatial cues required for expert compositional analysis.
Over half of tested models failed to outperform random guessing (25%) on the most challenging dimensions.
Larger models generally performed better within model series, indicating that greater capacity alone is insufficient without domain-specific data.

A key finding was a lack of robust alignment between visual representations and expert film language, limiting both accuracy and reliability for professional-level cinematic analysis tasks.

4. Model Development: ShotVL and Supervised Cinematic Training

ShotQA serves as the central training resource for ShotVL, an open-source VLM achieving state-of-the-art results on cinematic understanding tasks. The ShotVL training regime consists of two principal steps:

Supervised Fine-Tuning (SFT):
- The base model (Qwen2.5-VL-3B-Instruct) is trained with cross-entropy loss over the predicted answer choice, using the full ~70k ShotQA QA pairs as input.
Group Relative Policy Optimization (GRPO):
- ShotVL undergoes reinforcement learning on a focused subset of ~8k QA pairs.
- For each multimodal input $x$ , multiple answer samples $\{o_1, ..., o_G\}$ are generated; rewards $r(o, x)$ are binary (1 if correct; 0 otherwise).
- The relative advantage $A_i$ for each output is
$A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\}) + \delta}$

The GRPO loss is

$\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^G \min\left(\frac{\pi_{\theta}(o_i|x)}{\pi_{\theta_\text{old}}(o_i|x)} A_i, \text{clip}\left(\frac{\pi_{\theta}(o_i|x)}{\pi_{\theta_\text{old}}(o_i|x)}, 1-\epsilon, 1+\epsilon\right) A_i\right)$

where $\epsilon = 0.2$ and $\delta$ is a small constant for stability.

ShotVL outputs structured answers with explicit reasoning steps, such as > ... and <answer>...</answer> formats, further supporting interpretability aligned with cinematic conventions.

5. Performance Gains and Implications

ShotVL, trained on ShotQA, achieves 65.1% average accuracy on the ShotBench benchmark—a +19 point improvement over its Qwen2.5-VL-3B-Instruct baseline, and notably surpassing both GPT-4o and Qwen2.5-VL-72B-Instruct, despite a much smaller parameter count. This performance reflects consistent advances across all eight cinematic dimensions.

These results establish that open, expert-annotated, domain-specific data combined with focused supervised and reinforcement training yields significantly enhanced VLM perceptual reasoning for cinematic tasks—outperforming much larger models trained with generic data. This is the first demonstration of fine-tuned VLMs achieving and exceeding human-guided film language understanding requirements in a structured evaluation.

6. Open-Source Release and Research Facilitation

ShotQA, ShotBench, and ShotVL are made fully open-source to accelerate progress in this area. These resources:

Enable standardized, expert-level evaluation for cinematic visual understanding and reasoning.
Lower the barrier to entry for academic and industrial research in fine-grained video analysis and AI-driven content generation.
Support reproducibility and fair comparison across approaches.
Provide a valuable educational foundation via detailed reference annotations for professional filmmaking knowledge.

The project resources are accessible at https://vchitect.github.io/ShotBench-project/.

Summary Table: Representative Model Results (ShotBench)

Model	Avg. Acc.	#Params	Availability
ShotVL (Ours, Qwen2.5-3B)	65.1%	3B	Open-source
GPT-4o	59.3%	-	Proprietary
Qwen2.5-VL-72B-Instruct	59.1%	72B	Open-source

Conclusion

ShotQA defines the first comprehensive, expert-focused corpus for training and evaluating vision-LLMs in cinematic understanding. By aligning data and evaluation with professional film analysis, ShotQA—together with ShotVL and ShotBench—enables substantial, measurable progress in fine-grained, domain-critical visual reasoning, while fostering open, reproducible research and practical cinematic AI development.

PDF Markdown Chat (Upgrade)