QLEVR: Diagnostic Benchmark for VQA Reasoning

Updated 29 October 2025

QLEVR is a synthetic benchmark designed to assess VQA models' abilities to reason with complex quantifiers using over one million unique question-image pairs and detailed scene graphs.
It introduces 27 distinct quantifier types, including numerical, proportional, exception, and embedded forms, to rigorously test compositional reasoning beyond simple counting and attribute matching.
QLEVR employs acceptance-rejection filtering and balanced answer protocols to minimize bias, establishing a critical testbed for advancing vision-language architectures.

QLEVR is a large-scale, diagnostic synthetic benchmark designed to probe visual question answering (VQA) models for their capacity to perform advanced reasoning over quantificational language. Unlike CLEVR and related datasets focused primarily on existence, counting, and simple attribute-based inferences, QLEVR introduces formally compositional, minimally biased visual scenes and questions targeting the full spectrum of linguistic quantifiers—including numerical, proportional, exception, set-relational, and complex embedded forms. These quantifier phenomena are central to natural language semantics but rarely represented in machine learning or VQA evaluation protocols.

1. Dataset Construction and Scene Generation

QLEVR comprises 100,000 unique images, each representing a synthetic scene with desk-like surfaces containing one to five geometric planes (triangular, rectangular, circular) and a non-geometric background, with each scene hosting diverse objects (1–12 per plane or background). Objects span seven shapes, five materials, eight colors, and two sizes; spatial layout is designed for maximal clarity and explicit location cues (no occlusions, clear desk edges).

Image generation is performed programmatically in Blender with three camera angles per scene; detailed scene graphs are provided with ground-truth segmentation, attribute, and spatial metadata for every entity. This allows precise control over the properties referenced in each question and supports compositional reasoning.

2. Quantificational Language and Question Design

QLEVR’s central innovation is the formalization and automated realization of quantificational language for VQA. Each question is constructed from extensive templates, incorporating both object and plane attributes with slots for quantifier, number, spatial relation, and exception types. There are 671 distinct question templates, generating over one million unique question-image pairs.

The dataset covers 27 quantifier types:

Standard (all, some, no, every, each, most)
Numerical (exactly N, at least/at most N, between N and M)
Proportional (more than half, at least X% of)
Difference (more than/fewer than N)
Exception (all but at most/least N, every/no object except C)
Compound and negated forms
Embedded quantifiers requiring deep logical inference.

Quantifiers are formally specified; e.g.: $\begin{align*} \text{all}_P(A,B) &\Leftrightarrow A \subseteq B \ \text{some}_P(A,B) &\Leftrightarrow A \cap B \neq \emptyset \ \text{between~} n_1 \text{ and } n_2_P(A,B) &\Leftrightarrow n_1 \leq |A \cap B| \leq n_2 \ \text{more than half}_P(A,B) &\Leftrightarrow |A \cap B| > 0.5|A| \end{align*}$ where $A$ , $B$ are subsets defined by question scopes over attributes and scene structure.

Questions are lengthier (30–40 words on average) and often contain up to four quantifiers, imposing layered compositional semantic demands unprecedented in VQA benchmarks.

3. Quality Control, Minimal Bias, and Answer Balancing

QLEVR’s construction employs acceptance-rejection procedures to prevent statistical answer biases and spurious correlations previously shown to undermine VQA generalization performance. Trivial or ill-formed questions (e.g., those referencing non-existent object classes or asking redundant comparisons) are programmatically filtered; answer balancing is enforced to maintain a uniform True/False distribution across questions and quantifier types.

Template instantiation is randomized to maximize diversity while maintaining uniformity across splits (70K train, 15K val, 15K test), with virtually no duplicate question-image pairs between sets.

4. Scope of Quantificational Reasoning and Logical Coverage

QLEVR’s question suite intensively exercises the following dimensions:

Set Cardinality and Comparison: Numerical quantifiers, difference, fractions.
Generalized Quantifiers: Most, more than half, all but N, each, only.
Set Relations: Subset, intersection, disjointness, exception.
Logical Formulas: Compound, multi-clause, embedded negation, “square of opposition” entailments.
Cross-modal Integration: Visual grounding of abstract quantifier semantics (e.g., “Are there more than two red balls smaller than at least three blue balls?”).

A typical template may be: Are there exactly <n> <size> <color> <material> <shape>s on the <plane-color> <plane-material> <plane-shape> plane? which instantiates as, for example, “Are there exactly 2 small red rubber objects on the black wooden triangular plane?”

5. Model Evaluation Protocols and Baseline Results

QLEVR includes comprehensive evaluation using both text-only and image-language neural architectures:

Text-only: Q-type, LSTM, BERT.
Vision+Language: CNN+LSTM, MAC (Memory, Attention, Composition network with ResNet-101 visual features).

Experimental protocol involves model training on QLEVR, reporting test set accuracy over three random seeds.

Model	Overall Accuracy (%)
Q-type	50.0
LSTM	64.6
BERT	65.8
CNN+LSTM	65.9
MAC	66.5

Notable findings:

Text-only models outperform chance (Q-type baseline at 50%), indicating some residual linguistic priors despite dataset construction efforts.
Vision+LLMs provide only marginal accuracy lift, particularly for multi-plane, multi-attribute, and quantifier-rich questions.
Accuracy drops sharply with increasing number of quantifiers and with questions targeting “most,” “not all,” or embedded negatives—these receive the lowest scores.
Spatial relationship and set comparison reasoning consistently challenge all evaluated models.

6. Comparative Analysis, Benchmark Positioning, and Research Directions

QLEVR substantially extends the CLEVR paradigm, moving beyond existence, counting, and value comparison into advanced compositional quantifier logic, exception handling, and set-theoretic inferences. No prior dataset offers its breadth or systematic coverage of quantificational semantics.

Empirical results on QLEVR highlight the performance gap between current neural methods—even those excelling on CLEVR-like counting—and true quantificational/logical reasoning, particularly in visually grounded contexts. The dataset exposes fundamental limitations in pattern-based VQA and motivates the development of new architectures incorporating explicit compositional and symbolic reasoning capabilities, improved logical and set-theoretic representation, and cross-modal grounding.

QLEVR is available with full codebase, image scenes, detailed annotations, and question templates at [https://github.com/zechenli03/QLEVR].

7. Implications and Prospects

QLEVR establishes a new benchmark for quantificational language reasoning in visual contexts, serving as a rigorous diagnostic suite for elementary compositional inference, set relations, and predicate logic. Its construction and results urge the field to move beyond counting and simple attribute matching towards deeper integration of logical semantics—potentially with hybrid neural-symbolic or logic-augmented architectures.

Extension avenues include multi-language support, addition of further quantifier and logical operations, and targeted studies of model generalization between synthetic diagnostic and naturalistic datasets. QLEVR positions itself as a critical testbed for the next stages in VQA research, especially for models claiming human-level compositional understanding.

Markdown Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QLEVR.