Papers
Topics
Authors
Recent
2000 character limit reached

SCB-dataset: A Multidomain AI Benchmark

Updated 30 November 2025
  • SCB-dataset is a collection of diverse, domain-specific datasets used as benchmarks for tasks such as classroom behavior analysis, scene text editing, machine translation, and cultural reasoning.
  • These datasets feature rigorous annotation protocols, standardized labeling (e.g., YOLO formats), and detailed metrics like mAP, ensuring reliable evaluation across various modalities.
  • They support practical applications including real-time classroom analytics, adaptive instructional design, disentangled representation learning, and cross-cultural visual reasoning.

The term SCB-dataset refers to several distinct, domain-specific datasets widely used in machine learning, computer vision, computational linguistics, and quantum simulation. Most SCB-datasets are original, large-scale resources that serve as benchmarks or training sources for specialized tasks: classroom behavior detection, scene text editing, English–Thai machine translation, and cultural visual reasoning. Each dataset is characterized by careful annotation protocols, explicit compositional structure, and public distribution for research purposes, but they differ markedly in data modality, granularity, and use case.

1. Student Classroom Behavior (SCB) Datasets: Scope and Purpose

A dominant use of the SCB-dataset name is within educational computer vision, specifically as annotated image and video frame corpora for the detection and analysis of student or teacher behaviors in classroom settings. These datasets are constructed to enable automated recognition of classroom engagement indicators such as hand-raising, writing, reading, speaking, and other pedagogically relevant actions. Resources include "SCB-Dataset" (Yang et al., 2023, Yang, 2023, Yang et al., 2023, Yang, 2023), SCB-ST-Dataset4 (Yang et al., 2023), and closely related variants.

Their primary application domains are:

  • Real-time participation monitoring and spatio-temporal engagement tracking
  • Training and benchmarking object detection models for classroom analytics
  • Teacher feedback optimization and instructional design
  • Smart classroom solutions and adaptive tutoring systems

Student classroom behavior SCB-datasets fill a gap in educational AI, addressing the absence of high-quality, publicly available annotated resources for fine-grained behavior detection in crowded, dynamic classroom scenarios.

2. Technical Composition and Annotation Protocols

SCB-datasets for classroom behavior are typically compiled from recorded classroom videos (e.g., sources: "bjyhjy," "1s1k" web platforms). Annotation pipelines prioritize axis-aligned bounding boxes per student instance, assigning discrete behavior IDs based on standardized taxonomies (range: 3 to 19 classes).

Key characteristics from leading SCB-datasets:

Dataset Images/Frames Ann. Instances Behavior Classes Split
SCB-Dataset (Yang et al., 2023) 4,001 11,248 8 (standing, sitting, …) 80% train, 20% val
SCB-Dataset (Yang, 2023) 4,200 18,400 3 (hand_raising, reading, ...) 80% train, 20% val
SCB-Dataset3 (Yang et al., 2023) 5,686 45,578 6 (hand-raising, reading, …) 80% train, 20% val
SCB-Dataset (Yang, 2023) 13,330 122,977 12 (detection), 14 (classification) Custom per class
SCBehavior (Wang et al., 10 Oct 2024) 1,346 9,911 7 (writing, reading, ...) Class-wise split
SCB-ST-Dataset4 (Yang et al., 2023) 757,265 25,810 3 (hand-raising, reading, ...) 80% train, 20% val

The behavior definitions are detailed, accounting for pose, limb configuration, and interaction context. Annotation is executed by expert teams via image labeling tools, with quality control via cross-review and senior spot checking (Yang et al., 2023, Yang et al., 2023, Yang, 2023). Data are stored as JPEG/PNG images and YOLO-format text files (one file per image; normalized coordinates). Datasets may use additional metadata (e.g., frame rates, camera angle) and are distributed under open academic licenses.

3. Benchmarking Protocols and Evaluation Metrics

Benchmark protocols across SCB-dataset variants leverage object detection architectures (e.g., YOLOv5/7/8, SlowFast, Deformable DETR), reporting both aggregate and per-class performance metrics:

  • Precision: P=TPTP+FPP = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}
  • Recall: R=TPTP+FNR = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}
  • Intersection-over-Union: IoU(A,B)=ABAB\mathrm{IoU}(A,B) = \frac{|A\cap B|}{|A\cup B|}
  • Average Precision (AP): AP=01P(R)dRAP=\int_{0}^{1}P(R)\,\mathrm{d}R
  • Mean Average Precision (mAP) (at a threshold; e.g., [email protected])

Performance typically peaks at mAP@0.587%[email protected]\sim87\% for YOLOv7-BRA on original SCB-Dataset (Yang et al., 2023). Augmentation with attention (Bi-level Routing Attention, Wise-IoU) and multi-model fusion (YOLOv7+CrowdHuman, SlowFast, DeepSort) further enhance mAP. Long-tailed class imbalance is endemic (e.g., "looking up" ≫ "standing"); class-balanced losses and sampling schemes are recommended (Yang, 2023, Wang et al., 10 Oct 2024, Yang et al., 2023).

The spatio-temporal SCB-ST-Dataset4 (Yang et al., 2023) enables evaluation of video-based models, with SlowFast achieving mAP@0.5=82.3%[email protected]=82.3\%, but underperforming on sparse classes due to absent weighting.

4. Data Structure, Access, and Licensing

Standardized directory layouts facilitate reproducibility: images and label files are split by set, and per-image YOLO-format TXT corresponds bijectively via naming conventions (e.g., image_00001.jpg ↔ image_00001.txt). Classes.txt indexes behavior IDs; supporting scripts and metadata files are provided in the respective repositories (Yang et al., 2023, Yang et al., 2023).

All major SCB-datasets are distributed for research/public use, typically under MIT or CC BY(-NC-SA) licenses. Users are directed to the specific GitHub repository for licensing details, dependencies, and training/evaluation scripts (Yang et al., 2023, Yang et al., 2023, Yang, 2023, Yang et al., 2023).

5. Extensions Beyond Classroom Behavior

5.1 Scene Text Editing: SCB Synthesis Dataset

The "SCB-dataset" in the context of scene text editing denotes the SCB Synthesis dataset (Bao et al., 17 Nov 2025), constructed around the notion of an SCB Group—eight images formed by the combinatorial crossing of two distinct styles, contents, and backgrounds. Each synthetic text image is generated as I=G(S,C,B)I = \mathcal{G}(S, C, B), supporting explicit disentanglement of text style, content, and background for robust scene text editing via the TripleFDS framework.

The dataset comprises 1,000,000 training images (125,000 groups) and 80,000 validation images (10,000 groups), spanning:

  • Styles: 500 clustered fonts, multiple color and geometric transforms
  • Contents: 95-character set, string length 3–14
  • Backgrounds: Crops from the SceneVTG-Erase pool, two fusion modes (splicing, Poisson blending)

Attributes and ground truths (text string, style ID, background ID, fusion type) are tracked per image/group. The dataset explicitly supports contrastive and orthogonality-based feature disentanglement for state-of-the-art editing performance.

5.2 English–Thai Parallel Corpus: SCB-mt-en-th-2020

In computational linguistics, SCB-dataset refers to a large bilingual resource of 1,056,743 parallel English–Thai sentence pairs (Lowphansirikul et al., 2020). It draws from diverse domains (news, Wikipedia, SMS, task dialogue, web-crawled data, government documents), with rigorous normalization, length/script filtering, and stratified data splits (train/val/test). Released under CC-BY-SA 4.0, SCB supports both neural and statistical MT benchmarks, outperforming Google Translate when combined with OPUS data and serving as the current standard for English–Thai machine translation evaluation.

5.3 Cultural Visual Reasoning: Seeing Culture Benchmark (SCB)

The "Seeing Culture Benchmark" (SCB) (Satar et al., 20 Sep 2025) is a multimodal vision-language resource evaluating cross-cultural reasoning via two-stage protocols (VQA and segmentation) over 1,065 images from seven Southeast Asian countries. It explicitly focuses on cultural artifacts across music, dance, games, celebration, and wedding contexts, annotated with polygonal evidence masks and question rationales. This SCB is publicly released for zero-shot VLM benchmarking.

6. Methodological Notes and Research Impact

SCB-datasets catalyze advances in domain-specific machine learning by providing richly annotated, large-scale, and research-permissive corpora. In student behavior analysis, their release has enabled direct performance comparisons among detection, transformer, and video architectures under controlled yet realistic, high-occlusion scenarios. The SCB Synthesis dataset has advanced disentangled representation learning in scene text editing, while the SCB-mt-en-th-2020 corpus underpins competitive neural machine translation for low-resource language pairs.

Several datasets introduce novel metrics to quantify domain-specific challenges (e.g., Behavior Similarity Index for action visual overlap (Yang et al., 2023, Yang et al., 2023)), and their composite structure (class splits, groupings) enables ablation studies on imbalance and feature generalization.

7. Limitations, Recommendations, and Future Directions

  • Most SCB-datasets exhibit significant class imbalance; recommended mitigations include class-balanced sampling, focal loss, and instance weighting (Yang, 2023, Yang et al., 2023, Wang et al., 10 Oct 2024).
  • SCB resources are often limited to static frames; several works note the need for additional temporal (video) annotation, segmentation, and keypoint labels to move beyond per-frame recognition (Yang, 2023, Yang et al., 2023).
  • Domain restriction is common: classroom behavior datasets are predominantly Chinese, with generalizability to other educational contexts untested (Yang, 2023).
  • For scene text editing, the main limitation of SCB Synthesis is the synthetic-real gap and the absence of multilingual/3D text augmentation (Bao et al., 17 Nov 2025).

Ongoing work seeks to address these deficiencies by expanding label schemas, integrating temporal and keypoint data, and releasing standard train/val/test benchmarks for robust and reproducible comparison (Yang et al., 2023, Yang, 2023, Bao et al., 17 Nov 2025).


References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SCB-dataset.