Visual Question Answering Format

Updated 21 November 2025

Visual Question Answering Format is a standardized schema that integrates structured annotations, data types, and evaluation protocols to assess multimodal reasoning.
It encompasses diverse dataset categories such as authentic, synthetic, diagnostic, and knowledge-based, each tailored with specific annotation requirements.
The format supports various VQA tasks—including open-ended, multiple-choice, and knowledge-based—facilitating reproducible and unbiased performance comparisons.

Visual Question Answering (VQA) Format refers to the schema, data structures, and evaluation protocols used to represent, store, and benchmark datasets for the task of answering natural language questions about images. VQA combines challenges in vision and language modeling, demanding explicit alignment between visual inputs and textual queries, sophisticated reasoning, and adaptable annotation structures to support various task formulations such as open-ended, multiple-choice, compositional, knowledge-based, or multi-modal question answering (Kabir et al., 17 Nov 2024, Srivastava et al., 2019, Agrawal et al., 2015).

1. Dataset Categories and Core Data Structures

VQA datasets are classified into four principal categories, each with distinct structural features and annotation requirements (Kabir et al., 17 Nov 2024):

Authentic (Real Image) Datasets: Comprised of natural photographs (e.g., VQA v1.0/v2.0, COCO-QA), each datum structured in JSON includes:
- image_id: integer; unique image identifier.
- image_url/path: string; pointer to image resource.
- question_id: integer; unique question identifier.
- question: string; natural-language query.
- answer_type: string; coarse category (“yes/no”, “number”, “other”).
- multiple_choice: array of candidate answer strings (optional).
- answers: array; each object contains:
- answer: string; human-provided response.
- answer_confidence: string/probability (e.g., “yes”, “maybe”, “no”).
- multiple_choice_answer: string; modal human answer for scoring.
Synthetic Datasets: Feature programmatically generated images (e.g., CLEVR), with schema extended by:
- scene_graph: object; symbolic scene description.
- functional_program: array/tree; reasoning steps encoding.
- question_family: string; template identifier.
Diagnostic Datasets: Target specific challenges (e.g., CLEVR-Dialog), structured with:
- All synthetic fields plus:
- history: array; prior (question, answer) pairs.
- ground_truth_scene: optional scene description for evaluation.
Knowledge-Based Datasets: Measure external fact retrieval (e.g., OK-VQA, FVQA):
- knowledge_base_entries: array; each contains:
- kb_source: string, e.g., “DBpedia”.
- kb_id: resource identifier.
- kb_relation: predicate.
- kb_retrieved_text: supporting evidence.

Concrete example entry:

{
  "image_id": 123456,
  "image_url": "http://images.cocodataset.org/train2014/000000123456.jpg",
  "question_id": 98765,
  "question": "What country is famous for inventing sushi?",
  "answer_type": "other",
  "answers": [
    {"answer": "Japan", "answer_confidence": "yes"},
    {"answer": "Japan", "answer_confidence": "yes"},
    {"answer": "South Korea", "answer_confidence": "maybe"},
    {"answer": "China", "answer_confidence": "no"}
  ],
  "multiple_choice": ["Japan","China","South Korea","Thailand"],
  "multiple_choice_answer": "Japan",
  "knowledge_base_entries": [
    {
      "kb_source": "DBpedia",
      "kb_id": "Q17",
      "kb_relation": "countryOfOrigin",
      "kb_retrieved_text": "Sushi is a traditional Japanese dish of prepared vinegared rice..."
    }
  ]
}

2. Question and Answer Typologies

VQA formats standardize both question classes and answer annotation protocols (Srivastava et al., 2019, Kabir et al., 17 Nov 2024):

Open-ended: Answers are unconstrained, typically brief text (≤3 words). Annotated as 10 human responses with confidence/frequency tags.
Multiple-choice: Model selects from pre-listed candidates (multiple_choice), scored by modal human answer.
Question Partitioning (Multi-task): Questions partitioned by type (e.g., object, count, colour, position/spatial), enabling multi-head architectures and multi-task learning (Pollard et al., 2020).
Knowledge-based Answers: Accompanied by explicit KB links or external knowledge tags; answers may not be visually inferable.
Textual Grounding and Cross-modal Reasoning: Emerging datasets (e.g., VTQA) include paired “context” paragraphs (≥100 words) and require entity alignment across modalities (Chen et al., 2023).

Table: Typical Question Types

Type	Example Question	Annotation Format
Yes/No	Is there a dog?	Binary label ("yes"/"no")
Count/Number	How many apples?	Integer answer, often as class
Attribute/Object	What color is the hydrant?	Short text, open-vocab, modal label
Knowledge-based	Who invented sushi?	Text + KB link
Reasoning/Dialog	What did she do before lunch?	History, context required

3. Annotation Conventions and Best Practices

Annotation protocols are designed for reproducibility, ambiguity capturing, and robust evaluation (Agrawal et al., 2015, Kabir et al., 17 Nov 2024, Wang et al., 2022):

Confidence and Frequency: Store all human answers (typically 10) alongside “confidence” or explicit probability.
Field Naming: Use consistent keys (image_id, question_id, answers).
Versioning and Documentation: Include top-level version and schema_description fields.
Bias Mitigation: Balance answer classes (e.g., yes/no) in dataset construction; run active learning for hard negatives (Wang et al., 2022).
Multimodal context: For text-augmented VQA (VTQA), pair each image with a free-form paragraph; force question dependence on both modalities.
External Knowledge Integration: Link out to KB entries when required for answer generation.
Schema for Multi-Image or Gallery Tasks: Represent as arrays of image objects with per-image relevance labels (e.g., ChiQA: y ∈ {0,1,2} per image-query pair).

4. Evaluation Metrics and Protocols

VQA adopts canonical consensus-based, similarity, and ranking metrics tailored to answer ambiguity (Agrawal et al., 2015, Kabir et al., 17 Nov 2024, Srivastava et al., 2019, Wang et al., 2022, Chen et al., 2023):

Consensus-based Accuracy (VQA Standard):

$\text{Acc}(\hat{a}) = \min\left(1, \; \frac{\text{number of matches among humans}}{3}\right)$

$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^N \min\left(\frac{\sum_{j=1}^{10} [\hat{a}_i = a_{ij}]}{3}, 1\right)$

Similarity Metrics: WUPS@τ (Wu-Palmer), BLEU, METEOR, ROUGE-L; less common, typically for image captioning overlap.
Ranking Metrics (Multi-image tasks):

$\text{DCG}@k = \sum_{i=1}^k \frac{2^{r_i} - 1}{\log_2(i + 1)}$

$\text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$

where $r_i$ is the relevance label per ranked image (Wang et al., 2022).

Macro-F1 and Exact Match (VTQA): Used for open-ended, extractive, and generative QA (Chen et al., 2023).
Multi-task Loss (MTL) (Pollard et al., 2020):

$L = \sum_{i=1}^T \lambda_i L_i \qquad L_i = -\sum_{k=1}^{K_i} y^i_k \log\hat{y}^i_k$

5. Notable Schema Variants and Extensions

Advanced VQA endeavors extend or modify the baseline schema for new modalities, multi-image/compositional questions, instruction-driven prompts, and multi-task fusion (Lee et al., 13 Feb 2024, Chen et al., 2023):

Unified Instruction-based Format (VQA-IN): Each record is a tuple (instruction, image, question, answer), enabling domain-specific MLLM training across vision-language and custom tasks (Lee et al., 13 Feb 2024). This is formalized as:

$D_\text{VQA-IN} = \{ (\text{instr}_i, x_i, q_i, a_i) \}$

Gallery-based Format (ChiQA): Each query paired with 5 candidate images, each with a 2/1/0 answerability label; stored as JSON arrays with per-image label objects (Wang et al., 2022).
Text-augmented Cross-media Schema (VTQA): Each example contains image, paired paragraph-length “context”, open-ended question, and answer with explicit answer_type; multi-hop entity alignment enforced via attention mechanisms (Chen et al., 2023).

6. Impact, Best Practices, and Benchmarking Protocols

Rigorous application of the VQA format across new datasets and methods ensures fair benchmarking and enables algorithmic advances (Kabir et al., 17 Nov 2024, Agrawal et al., 2015). Recommended practices include:

Releasing complete code for evaluation metric computation (e.g., consensus-based accuracy, NDCG).
Public splits for train/val, private test server for blind leaderboard ranking.
Consistent annotation schemas across splits and dataset types.
Programmatic APIs exposing scene graphs, functional programs, and KB links for diagnostic and knowledge-aware tasks.
Explicit documentation of annotation and quality-control protocols, including crowdworker instructions and control procedures.
Adoption of multi-task schema or instruction-based conversion for scalable benchmarking of multitask MLLM architectures.

This comprehensive schema and protocol facilitate cross-architecture compatibility, unbiased comparison of novel VQA models, and rapid, reproducible progress in research involving multimodal machine reasoning and vision-language understanding (Kabir et al., 17 Nov 2024, Srivastava et al., 2019, Agrawal et al., 2015, Wang et al., 2022, Chen et al., 2023, Lee et al., 13 Feb 2024, Pollard et al., 2020).