SVBench: Evaluation of Video Generation Models on Social Reasoning
Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This document is not a typical research paper. It’s a set of instructions (a template) that tells authors how to write a short “author response” or “rebuttal” to reviewers after their paper is reviewed. Think of it like a one-page letter to the judges, where you can correct misunderstandings and answer questions—without adding new, unrequested work.
Key Objectives and Questions
The template aims to answer simple, practical questions for authors:
- What is allowed in a rebuttal and what is not?
- How long can the rebuttal be?
- How should the rebuttal be formatted (fonts, columns, margins, figures, references)?
- How can authors keep the review process fair (for example, by staying anonymous)?
Methods or Approach
Instead of running experiments, this document sets rules and gives a ready-to-use format in LaTeX (a software tool many researchers use to write papers neatly and consistently).
Here are some key terms explained in everyday language:
- LaTeX: A tool that helps you write documents in a clean, professional style—like a more advanced version of a word processor.
- Rebuttal: A short response to reviewers where you correct factual mistakes or provide requested clarifications.
- Two-column format: The page is split into two narrow text columns, like a magazine or newspaper. This saves space and makes reading easier.
- Margins: The blank edges around the text. These must be kept at the specified size so nobody squeezes in extra text.
- Figures and tables: Pictures, graphs, or data boxes. They should be readable even when printed, not just when zoomed in on a screen.
- Numbered equations: If you show any math formulas, you add a number to each so people can point to them easily.
- Anonymity: Don’t include anything that reveals who the authors are. It’s like a blind audition so reviewers judge only the content.
- includegraphics: A LaTeX command that adds images to the document.
Main Points and Why They Matter
Below are the most important rules the template sets, explained simply:
- Keep it short: The rebuttal must be no longer than one page, including any figures and references. If it’s too long or the formatting is stretched to fit more, it won’t be reviewed.
- Focus on clarifications: Use the rebuttal to fix factual errors and answer reviewer questions. Don’t add brand-new ideas, big new experiments, or new theorems that weren’t in your original paper—unless reviewers specifically asked for them.
- Stay anonymous: Don’t include external links or information that could reveal your identity or sidestep the page limit.
- Use the given format:
- Two columns, specific margins, and 10-point Times (or Times Roman) font for the main text.
- Numbered equations.
- Figures and tables with 9-point captions that match the document style.
- Indent paragraphs slightly.
- Make visuals readable: Center graphics, use font sizes and line thickness that are clear on a printed page, and don’t rely on zooming to see tiny details.
- Reference properly: List references at the end in 9-point font, numbered, and cited with square brackets in the text (like [12]).
- Keep numbering distinct: If you refer to figures or equations, make sure their numbering doesn’t clash with your main paper, so reviewers aren’t confused.
- Don’t add new experiments unless asked: A community rule says reviewers shouldn’t demand big new experiments for the rebuttal, and authors shouldn’t include them unless requested.
These rules matter because they:
- Keep the review process fair (everyone follows the same limits and stays anonymous).
- Make responses easy to read and compare.
- Prevent authors from overwhelming reviewers with new, unreviewed material.
Implications and Impact
By following this template, authors can make clear, honest, and fair rebuttals that help reviewers understand the paper better without changing the game at the last minute. This leads to more consistent decisions, less confusion, and a smoother review process for everyone. In short, it’s a recipe for a respectful and efficient conversation between authors and reviewers.
Knowledge Gaps
Unresolved gaps and open questions
The paper provides high-level formatting and scope guidance for author rebuttals but leaves several practical, procedural, and edge-case questions unanswered. Future work could clarify the following points:
- Precise policy boundaries on “new contributions”: What constitutes a “new contribution” versus acceptable “additional information,” especially for minor experiments, ablations, error analyses, or updated proofs requested ambiguously by reviewers.
- Handling reviewer requests that violate the “no significant new experiments” motion: What authors should do when reviewers request substantial experiments—escalation process, how to document such requests in the rebuttal, and how ACs will adjudicate.
- Permissibility of adding new results derived from existing data: Whether simple recomputations, corrected metrics, or re-measured statistics from the original submission are allowed if not explicitly requested.
- Scope of comparison tables: Whether new comparison tables involving results from other papers not present in the original submission are allowed, and under what conditions (e.g., must be strictly drawn from already-cited sources).
- Use of external links: Clear rules on linking to code, demos, prior work, or artifacts (e.g., anonymous repositories) that do not reveal identity and do not circumvent page limits, including whether DOIs or arXiv links are permitted in references.
- Anonymity in references: Whether and how to cite the authors’ own prior work in an anonymized rebuttal, and if any special masking or third-person phrasing is required.
- Conflict resolution across reviewers: Guidance for responding when reviewers’ requests conflict (e.g., one requests new experiments, another forbids them), including recommended structure and prioritization.
- Best-practice structure beyond formatting: Recommended organization (e.g., point-by-point responses keyed to reviewer IDs), tone, and strategies for concisely addressing multiple detailed comments within one page.
- Clarification on figures: Minimum resolution/dpi, acceptable formats (PDF/PNG/JPG), color usage and print legibility standards (e.g., colorblind-safe palettes), and whether vector graphics are required.
- Font and layout constraints: Explicit enforcement checks (e.g., font embedding, font sizes, line spacing) and what “significantly altered” margins/formatting means (with measurable thresholds).
- A4 versus Letter specifics: Complete margin requirements for A4 (top margin, side margins), and whether the template auto-adjusts or if authors must switch geometry parameters manually.
- Equation numbering overlap “workaround”: The text references a LaTeX workaround without detailing it; provide explicit methods (e.g., prefixing counters or using separate numbering sequences) in the document.
- Referencing figures/tables/equations from the main paper: Whether it is acceptable to refer to main-paper items directly (e.g., “Figure 1 in the submission”), or if all references must be self-contained within the rebuttal.
- Page limit inclusions: Clarify whether footnotes, acknowledgments, and any author IDs/metadata count toward the one-page limit, and whether a title header or reviewer IDs are part of that limit.
- Inclusion of proofs: The text suggests proofs may be added; specify constraints on length, novelty, and whether new lemmas are allowed if they clarify existing results without introducing new claims.
- Use of smaller fonts: The template sets captions/references smaller; clarify whether authors may further reduce font sizes for figures/tables to fit content, or if this is prohibited.
- Submission logistics: Missing details on file requirements (PDF version, size limits, font embedding), submission portal steps, naming conventions, and deadlines; include a checklist to reduce technical rejections.
- Citations and style enforcement: Specify allowed citation styles (BibTeX vs. manual), whether URLs are permitted in references, and requirements for numbered vs. author-year citations to ensure consistency.
- Accessibility requirements: Any expectations for accessible rebuttals (e.g., alt text for figures, contrast ratios, readable line widths) to support diverse reviewer needs.
- Guidance on addressing known errors: Whether it is acceptable to acknowledge and correct factual errors from the original submission within the rebuttal, and how to document the impact without running afoul of “no new contributions.”
- Limits on the number/size of figures and tables: Provide concrete caps (e.g., max one figure or table, max width/height) to prevent formatting abuse while ensuring clarity.
- Clarification of “maintain anonymity” in practice: How to handle citations that might indirectly reveal identity (e.g., unique datasets or tools), and whether anonymized supplementary materials can be referenced by identifier.
- Template completeness and examples: Provide a complete minimal rebuttal example (with sections, equations, figures, and references) that compiles, including the “paper ID” field referenced but not shown in the text.
- Policy currency and applicability: The “2018 PAMI-TC motion” is cited—clarify whether this policy is universally applicable, updated for current conferences, and how it interacts with each venue’s specific rebuttal rules.
- Printing and legibility standards: Offer concrete numeric guidance for figure font sizes, line thicknesses, and minimum feature sizes to ensure readability when printed at 100% scale.
Glossary
- A4 paper: An international standard paper size measuring 210 mm × 297 mm, commonly used outside North America. "for A4 paper, approximately inches (4.13 cm) from the bottom edge of the page."
- agentic capabilities: AI system features that enable autonomous, goal-directed planning and acting like an agent. "next generation agentic capabilities"
- autoregressive: A modeling approach where each output step conditions on previous outputs in a sequence. "Autoregressive Video Diffusion"
- benchmark suite: A standardized collection of tests and datasets used to evaluate and compare models. "Comprehensive benchmark suite for video generative models"
- cref (LaTeX command): A LaTeX macro (from the cleveref package) for automatic, context-aware cross-references. "as in \cref{fig:onecol}"
- egocentric video: Video captured from a first-person viewpoint showing the wearer's perspective. "egocentric video"
- false belief: A concept in cognitive science describing the understanding that others can hold beliefs that are incorrect. "false belief"
- FVD: Frechet Video Distance, a metric for assessing the quality of generated videos by comparing distributions of features. "FVD: A new metric for video generation"
- includegraphics (LaTeX command): A LaTeX command for inserting external images into a document. "\includegraphics[width=0.8\linewidth]"
- intrinsic faithfulness: The degree to which generated content adheres to intended or source semantics without external aids. "intrinsic faithfulness"
- joint action: Coordinated actions performed by two or more individuals to achieve a shared goal. "joint action and interpersonal coordination"
- LaTeX: A document preparation system widely used for academic and technical typesetting. "See \LaTeX\ template for a workaround."
- latent diffusion: A generative modeling technique that performs diffusion processes in a compressed latent space for efficiency. "Realtime video latent diffusion"
- linewidth (LaTeX length): A LaTeX length representing the width of the current line, commonly used to scale figures. "\includegraphics[width=0.8\linewidth]"
- Machiavellian intelligence: The hypothesis that complex social maneuvering drove the evolution of advanced cognition. "Machiavellian intelligence"
- mentalizing: Inferring and representing others’ mental states such as beliefs, desires, and intentions. "The neural basis of mentalizing"
- normative structure: The system of social rules and expectations that guide behavior within a context. "the normative structure of joint pretend games"
- optical flow: The pattern of apparent motion of objects between video frames used to estimate movement. "optical flow-guided frame prediction"
- PAMI-TC: The IEEE Pattern Analysis and Machine Intelligence Technical Committee, which sets community guidelines and policies. "Per a passed 2018 PAMI-TC motion"
- pedestrian dynamics: The study and modeling of how pedestrians move and interact in crowds. "pedestrian dynamics"
- pica: A typographic unit equal to 12 points, used to measure lengths in typesetting. "All paragraphs should be indented 1 pica (approx.~ inch or 0.422 cm)."
- Roman type: Upright serif typeface style used in body text (as opposed to italic or sans-serif). "Figure and table captions should be 9-point Roman type"
- social force model: A mathematical model that represents pedestrian movement as resulting from social and physical forces. "Social force model for pedestrian dynamics"
- text-to-video generation: Automatically creating videos from textual descriptions using generative models. "Text-to-video generation without text-video data"
- theory of mind: The capacity to attribute mental states to oneself and others and understand they may differ. "Theory of mind may have spontaneously emerged in LLMs"
- two-column format: A document layout in which text is arranged in two vertical columns per page. "All text must be in a two-column format."
- video generative models: Machine learning models that synthesize video content, often conditioned on text or images. "video generative models"
- visual percepts: The immediate contents of what is perceived visually by an observer. "the visual percepts of others"
Practical Applications
The paper is a style and policy guide for one-page, anonymous author rebuttals in peer-review (e.g., CV/CVPR-style) workflows. Its practical value lies in standardizing formatting, scope, and policy compliance. Below are actionable applications derived from these guidelines.
Immediate Applications
The following can be deployed with current tooling and minimal process changes.
- Rebuttal formatting compliance checker (software, academic publishing)
- What: Automated PDF/LaTeX linter to verify page limits (including figures/references), margins, column widths, fonts, centered graphics, equation numbering, and caption sizes.
- Tools/Products/Workflows: Overleaf/VS Code extension; GitHub Action/CI; submission “preflight” validator.
- Assumptions/Dependencies: Reliable PDF parsing or access to .tex; buy-in from authors/editors.
- Submission-portal preflight gate (academic publishing, policy)
- What: Integrated check in OpenReview/CMT/Editorial Manager that blocks non-compliant rebuttals and provides actionable diagnostics.
- Tools/Products/Workflows: “RebuttalCheck” API integrated with submission sites.
- Assumptions/Dependencies: Program chairs’ approval; submission-platform APIs.
- Anonymity and external-link leakage scanner (academic publishing, software, policy)
- What: NLP-based detector that flags names, affiliations, linkouts, or metadata that could deanonymize authors or circumvent length limits.
- Tools/Products/Workflows: Privacy-preserving text and PDF metadata scanner; redaction suggestions.
- Assumptions/Dependencies: Accurate PII/link detection; alignment with double-blind policies.
- Citation/figure/equation numbering disambiguator (software)
- What: LaTeX package/script to ensure rebuttal numbering does not overlap with the main paper (e.g., auto-prefixing refs/figs).
- Tools/Products/Workflows: Bib/aux isolation script; class/package option in the provided template.
- Assumptions/Dependencies: Authors use the official template; simple build integration.
- Reviewer policy compliance monitor (policy, academic publishing)
- What: Classifier that flags reviewer requests for “significant additional experiments” during rebuttal (contrary to stated policy), assisting ACs/PCs.
- Tools/Products/Workflows: Review-text monitor and dashboard for area chairs; soft notifications to reviewers.
- Assumptions/Dependencies: Access to review text; precision to avoid over-flagging; reviewer consent.
- One-page rebuttal writing assistant (education, software)
- What: LLM-assisted drafting that focuses on rebutting factual errors, answering requested clarifications, and avoiding new, unsolicited contributions; dynamic length control.
- Tools/Products/Workflows: Overleaf/Docs plug-in; structured prompt templates; “tighten-to-fit” rewrites.
- Assumptions/Dependencies: Data privacy; opt-in use; domain-tuned writing prompts.
- Figure normalization utility (software)
- What: Batch tool to harmonize figure font sizes (e.g., 9pt Roman), line widths, and print readability; auto-center and width conformance (e.g., 0.8×linewidth).
- Tools/Products/Workflows: Inkscape/Matplotlib/LaTeX pipeline scripts; preflight “figure audit.”
- Assumptions/Dependencies: Access to vector assets preferred; consistent figure sources.
- Lab/team rebuttal workflow and checklist (academia, daily life/skills training)
- What: Rebuttal sprints with checklists for scope control, anonymity, numbering, figure readability, and length; internal mock review and preflight checks.
- Tools/Products/Workflows: Shared checklists; templated timelines; “rebuttal captain” role.
- Assumptions/Dependencies: Team adoption; minimal training; use of official template.
Long-Term Applications
The following require broader adoption, standardization, or additional R&D.
- Machine-readable rebuttal policy standard (software, policy)
- What: A common schema (e.g., “RebuttalML”) encoding length, anonymity, formatting, and content rules to enable fully automated compliance across venues.
- Tools/Products/Workflows: Cross-venue policy registry; validator SDKs.
- Assumptions/Dependencies: Community consensus; standards body or consortium.
- Structured rebuttal editor inside submission portals (software, academic publishing)
- What: WYSIWYG editor with real-time page rendering, hard length caps, formatting locks, and live compliance hints; PDFless submission.
- Tools/Products/Workflows: Portal-native editor; server-side LaTeX rendering.
- Assumptions/Dependencies: Platform development; performance and UX at scale.
- Policy-aware reviewer copilot (policy, software)
- What: Reviewer assistant that nudges against policy-violating requests (e.g., significant new experiments), suggests constructive queries, and highlights fairness issues.
- Tools/Products/Workflows: Contextual LLM plug-in within review UIs; AC oversight dashboard.
- Assumptions/Dependencies: Cultural acceptance; accurate policy grounding; transparency.
- PDF forensics and watermarking for template integrity (software, security)
- What: Robust detection of margin tampering, hidden whitespace hacks, or font spoofing; watermark indicating unmodified template compliance.
- Tools/Products/Workflows: PDF forensics engine; template-integrity seal.
- Assumptions/Dependencies: Low false positives; resistance to adversarial edits.
- Cross-venue analytics for fairness and workload (policy, academia)
- What: Aggregate metrics on rebuttal length, outcomes, reviewer load, and policy adherence to inform evidence-based policy updates.
- Tools/Products/Workflows: Privacy-preserving analytics pipelines and dashboards.
- Assumptions/Dependencies: Data sharing agreements; ethical governance.
- Accessibility-first rebuttal templates (academic publishing, software)
- What: Built-in checks for print/readability and accessibility (e.g., color contrast in figures, alternative text support), adapted to visually diverse needs.
- Tools/Products/Workflows: Accessibility linting; figure color-blindness simulators.
- Assumptions/Dependencies: Consensus on accessibility standards; tooling integration.
- Multilingual, compliance-preserving translation (software, academia)
- What: High-fidelity translation workflows that retain formatting, anonymity, length, and numbering constraints across languages.
- Tools/Products/Workflows: Template-aware MT; post-editing interfaces with live compliance counters.
- Assumptions/Dependencies: Domain-tuned MT; careful handling of name/locale cues.
- Enterprise adaptation of rebuttal practices (industry, education)
- What: Applying the one-page, policy-bounded rebuttal format to internal R&D reviews, design critiques, compliance responses, and incident postmortem debates to improve clarity and speed.
- Tools/Products/Workflows: Internal templates; coaching modules; automated compliance checks.
- Assumptions/Dependencies: Change management; alignment with corporate review norms.
Collections
Sign up for free to add this paper to one or more collections.