AGT Dataset for Autism Gaze Analysis

Updated 21 November 2025

The AGT dataset is a large-scale resource defined by frame-level annotations to capture social and non-social gaze behaviors in children with ASD.
The dataset employs a detailed annotation protocol using CVAT with rigorous quality control, including Cohen’s κ of 0.757 and IoU-based spatial agreement.
It supports robust benchmarking with standardized splits and highlights challenges like pronounced class imbalance, critical for refining automated gaze detection methods.

The Autism Gaze Target (AGT) dataset is a large-scale resource developed to advance automated gaze target detection for young children diagnosed with Autism Spectrum Disorder (ASD). Designed to address foundational bottlenecks in joint attention measurement and social gaze analysis, the AGT dataset underpins state-of-the-art machine learning research in ASD-specific visual behavior analysis by providing well-annotated activity footage, rigorous quality control, and standardized evaluation benchmarks (Deng et al., 14 Nov 2025).

1. Data Collection: Participants, Sessions, and Recording Protocol

The AGT dataset comprises 59 session-length video recordings, each corresponding to an individual young autistic child. These sessions were conducted under institutional review board (IRB) approval during the administration of the Communication and Symbolic Behavior Scales–Developmental Profile (CSBS-DP) by trained clinicians. Sessions consistently employ the Toddler Module, situating the participant pool in the early-childhood developmental stage. Each recording typically includes the child, a clinician, and often a parent, with all visible in a laboratory playroom environment. The single static camera configuration ensures full coverage of the child, adults, and relevant objects while preserving naturalistic interaction dynamics. Elicited behaviors include both social attention (e.g., face-directed gaze during naming or pointing tasks) and non-social attention (e.g., object-focused play), supporting comprehensive analysis of gaze behaviors intrinsic to ASD (Deng et al., 14 Nov 2025).

2. Annotation Scheme and Protocol

Annotation of gaze target is executed at the frame level using the Computer Vision Annotation Tool (CVAT). The protocol requires bounding boxes around the child’s face (“head box”), all adult faces, and the ground-truth gaze target region, classified into four categories: Object, Face, Person-Non-Face (adult body parts excluding the face), and Noninclusive (gaze outside frame or without discernible target). Annotation was performed on 16,582 frames, each grouped from five-frame clips (sampled at 1 fps) to maintain temporal context. Each frame was labeled by a single trained annotator, with a randomly selected subset of 800 frames subject to double annotation for quality control. Inter-annotator agreement on category labeling is measured by Cohen’s κ, attaining a value of 0.757, signifying substantial consistency. For bounding box overlap, spatial agreement is defined as:

$\text{agreement} = \begin{cases} 1, & \text{if} \ \text{IoU} > \tau \ 0, & \text{otherwise} \end{cases}$

where $\text{IoU}$ denotes the Intersection over Union of two bounding boxes and $\tau$ is the acceptance threshold (Deng et al., 14 Nov 2025).

3. Dataset Statistics, Challenges, and Splits

After filtering out unresolvable frames (no detectable gaze target), the curated AGT dataset consists of 16,582 frame-level samples. The cardinality of each gaze target category is highly imbalanced, with the face-directed (social) class present in 1,088 frames (6.6%), and the remaining 15,494 frames (93.4%) containing not-face-directed gaze (i.e., object and person-non-face). This pronounced skew presents a significant obstacle for standard deep learning techniques, which tend to overfit to the majority (non-social) class, thereby failing to capture rare but clinically critical social gaze events.

Dataset splits are as follows:

Split	Frames
Training	9,874
Validation	3,344
Test	3,364

This structure supports robust model benchmarking and fair comparison across research works. The low prevalence of face-directed gaze is consistent with established behavioral patterns in ASD (Deng et al., 14 Nov 2025).

4. Data Format, Availability, and Toolkit

Frames are provided as image files in JPEG or PNG format, downsampled to $224 \times 224$ resolution for experimental consistency. Annotation files exported from CVAT follow JSON or PASCAL VOC XML conventions, specifying the coordinates and sizes of the child’s head box, adult face boxes, and the labeled gaze target region plus category. The public GitHub repository (https://github.com/ShijianDeng/AGT) offers comprehensive data loaders for PyTorch, image preprocessing scripts (including cropping and blur augmentation), and training/evaluation pipelines that facilitate direct benchmarking of new methods against published baselines. Download details and access protocols are communicated in the repository README (Deng et al., 14 Nov 2025).

5. Model Benchmarks and Performance Metrics

Multiple state-of-the-art models are evaluated on the AGT benchmark:

GazeLLE: Designed for neurotypical pediatric gaze estimation.
Sharingan: Transformer-based and pre-trained on the Childplay dataset.
Qwen2.5-VL-7B-Instruct: Multimodal LLM fine-tuned for Face vs. Not-Face classification.

Performance is measured by $L^2$ (normalized Euclidean error of predicted gaze point), as well as macro-averaged precision, recall, and F1 score for Face vs. Not-Face discrimination.

Key results from the test set:

Model	$L^2$ (↓)	F1 (↑)
Sharingan	0.0615	0.5660
GazeLLE	0.0670	0.4865
Sharingan-AGT	0.0486	0.7531
GazeLLE-AGT	0.0460	0.6684
Qwen2.5-VL-AGT	0.0475	0.6630
Sharingan-SACF	0.0480	0.7610
GazeLLE-SACF	0.0453	0.6866

Face-class $L^2$ is reduced by 13.9% with Sharingan-SACF and 9.8% with GazeLLE-SACF compared to respective AGT-tuned baselines. An oracle gate module demonstrates upper-bound F1 near 0.99, underscoring the improvement potential in social-context routing (Deng et al., 14 Nov 2025).

The AGT dataset addresses the absence of large, diagnostic context datasets annotated specifically for autistic children. Its careful construction and rigorous annotation set a domain standard for ASD-focused gaze analysis. With the inclusion of social and non-social target categories and a severe class imbalance—mirroring real-world ASD attentional distributions—the dataset enables the development and evaluation of AI systems capable of sensitive and context-aware joint attention measurement.

A plausible implication is that novel architectures, particularly those leveraging explicit social context (e.g., the Socially Aware Coarse-to-Fine [SACF] framework introduced in the reference paper), may realize further gains in minority-class detection. This is clinically relevant since capturing rare social gaze events is critical for differentiating typical and atypical social attention patterns in ASD. The upper-bound provided by the oracle gate highlights that optimal social context-awareness remains a primary research lever (Deng et al., 14 Nov 2025).

7. Access, Community Tools, and Research Directions

AGT, together with its open-source codebase, offers an extensible platform for benchmarking and cross-comparison of ASD-specific gaze detection models. By introducing robust pipelines and reproducible evaluation criteria, it invites further methodological advances in handling severe class imbalances and contextual attention modeling. The AGT infrastructure is positioned to support large-scale, generalizable research into social cognition measurement, with immediate translation potential for both scientific inquiry and clinical tool development in ASD (Deng et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Toward Gaze Target Detection of Young Autistic Children (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Autism Gaze Target (AGT) Dataset.