CGBench: LM Reasoning in Genetics & Video
- CGBench is a comprehensive evaluation framework designed for assessing language models in both clinical genetics curation and long-video reasoning.
- It features tasks like evidence scoring, verification, and gene curation with specific metrics (e.g., Precision@5, F1 scores) to quantify model performance.
- The benchmark integrates adaptive token reduction in video tasks, enhancing QA accuracy and temporal localization for robust multimodal evaluation.
CGBench is a nomenclature adopted by multiple benchmarks and frameworks concerned with evaluation in computational genetics, video question answering, and high-performance graph-based scientific computing. Presently, CGBench refers most prominently to a benchmark for assessing the scientific reasoning capabilities of LMs in clinical genetics, constructed from rigorously curated expert interpretations of the biomedical literature. In parallel, the term also designates a demanding large-scale VideoQA evaluation suite for long video understanding and temporal localization tasks. Across these instantiations, CGBench embodies the principles of precise evidence extraction, structured assessment, and comparative analysis of advanced computational systems.
1. Overview and Motivation
The clinical genetics variant of CGBench was introduced to quantify the competence of generative LMs in automating the interpretation of variants and genes from primary biomedical literature. Traditionally, this process is manual, intensive, and dependent on expert adherence to protocols such as those from ClinGen and ACMG/AMP. CGBench addresses the gap between narrowly-defined machine reading tasks and the real-world, multi-step reasoning required for clinical curation. Its significance lies in providing a robust, expert-validated foundation for evaluating LM performance directly on the literature and guidelines comprising the daily workflow in translational genomics and medical diagnostics (Queen et al., 13 Oct 2025). In the context of video reasoning, CGBench is characterized as a grounded VideoQA benchmark, presenting challenges related to long video temporal localization and multimodal evidence aggregation (Zhang et al., 30 May 2025).
2. Benchmark Architecture and Task Design
CGBench in clinical genetics is architected around three principal tasks, directly reflecting the workflows of ClinGen curators:
- VCI Evidence Scoring (E-Score): For each input tuple , the LM must assign one or more evidence codes after reading , the publication text. Codes are drawn from a hierarchical set (primary, secondary, tertiary), with structured textual explanation required for each assignment.
- VCI Evidence Verification (E-Ver): The LM assesses whether a specified evidence code is "met" or "not met" for a given input and publication , outputting .
- GCI Experimental Evidence Extraction (Gene Curation): The LM, given a gene-centric tuple and a publication, is tasked to extract structured experimental evidence as (category, explanation, score, justification).
Each task utilizes the corresponding protocols and standard operating procedures (SOPs) that define clinical curation. Formally, the evidence scoring function is denoted as:
Similar formalisms govern verification and extraction sub-tasks, ensuring protocol fidelity and clear annotation boundaries.
3. Evaluation Methodology and Metrics
CGBench implements evaluation criteria suited to the complexity of expert curation:
- Evidence Scoring is treated as a multilabel classification problem, with Precision@5, recall, and hierarchical stratification (primary, secondary, tertiary codes). The best model achieves approximately 0.420 Precision@5 on tertiary codes—a recognized bottleneck.
- Evidence Verification employs true positive/negative rate and F1 score, noting that models systematically over-predict "met" relative to curated ground truth (positive rate ≈ 0.434).
- Experimental Extraction utilizes category matching precision/recall, normalized mean absolute error (MAE) for evidence strength, and Strength (correct deviation prediction). Outputs are checked for structured format adherence.
Additionally, the LM explanations are compared against expert annotations using an LM-as-a-judge protocol. Agreement rates are measured under zero-shot and 30-shot (in-context) prompting conditions, with the latter achieving substantial improvements (e.g., GPT-4o agreement rises from ~48.6% to 70.4%).
4. Empirical Model Comparisons
Eight LLMs, spanning closed-source (GPT-4o, Sonnet 3.7) and open-weight (Qwen2.5 72B, LLaMA 4) variants as well as specialized reasoning models (Deepseek R1, o3-mini, o4-mini), were benchmarked on CGBench:
- Evidence Scoring: Best outcomes are observed in reasoning-capable and larger models, especially for tertiary codes, but overall fine-grained performance is modest.
- Evidence Verification: GPT-4o attains highest F1 (≈0.634) but all models tend toward over-prediction.
- Experimental Extraction: Precision for category matching is highest in GPT-4o (≈0.493), while o4-mini achieves superior balance in evidence scoring. All models tend to over-extract evidence ("noise") compared to curated ground truth.
Qualitative assessment via LM-as-a-judge reveals frequent hallucinations and misinterpretations in LM-generated explanations, despite correct evidence classification, underscoring limitations in semantic alignment and interpretive nuance.
5. Video Reasoning Instantiation of CGBench
In video reasoning, CGBench serves as an evaluation bed for large-scale VideoQA tasks with long temporal contexts. Frameworks such as SiLVR utilize CGBench to benchmark the integration of multisensory video inputs (visual, speech) transformed to dense language representations. Critically, an adaptive token reduction algorithm iteratively condenses input so as to maximize information density within the LM context window. The process enhances both QA accuracy and temporal localization (mean Intersection over Union), with SiLVR outperforming previous state-of-the-art baselines (6–7% improvement in mIoU, up to 51.8% QA accuracy). Ablation confirms that adaptive token reduction outpaces fixed-clip baselines by approximately 2.5% in accuracy (Zhang et al., 30 May 2025).
6. Implications, Limitations, and Prospects
CGBench reveals significant progress and persistent gaps in LM-driven scientific interpretation. While models automate extraction and categorization, they face challenges in fine-grained evidence assessment and explanation fidelity. The tendency to hallucinate or loosely paraphrase the source undermines adoption in high-stakes contexts.
Advances in prompting (role-playing, in-context), structured output adherence, and agentic multi-document aggregation are proposed as future directions. In video comprehension, dynamic input condensation is essential for long-context reasoning. The benchmark elucidates the need for stricter semantic coupling between LM-generated outputs and curated ground truth to facilitate reliable deployment in biomedical, video, and graph-based scientific domains.
7. Technical Summary and Formalisms
Key technical aspects include the explicit mapping of input queries, text, and code hierarchies to formal evaluation functions, e.g.,
and
These function-based definitions ensure transparent and reproducible benchmarking with clear boundaries for automated and manual review. The use of hierarchical code sets, structured explanations, and protocol-driven extraction distinguishes CGBench from prior narrow benchmarks.
CGBench, in both clinical genetics and video reasoning domains, serves as a rigorous, protocol-aligned evaluation framework for measuring the limits and potential of advanced LMs and multi-stage reasoning systems. It establishes a foundation for future research targeting precise, high-stakes scientific interpretation and multimodal understanding.