Papers
Topics
Authors
Recent
Search
2000 character limit reached

CHIMERA Dataset Overview

Updated 17 March 2026
  • CHIMERA Dataset is a suite of synthetic benchmarks spanning LLM reasoning, compositional image synthesis, and insider threat simulation for robust ML evaluation.
  • Its reasoning component employs an advanced three-stage LLM-driven pipeline to generate, validate, and label challenging chain-of-thought problems.
  • The image and insider threat datasets offer fine-grained compositional control and realistic multi-modal logs, advancing research in vision and security domains.

CHIMERA refers to several distinct datasets in the machine learning and security literature, each addressing specific challenges in synthetic data generation for reasoning, compositional image synthesis, and insider threat simulation. The term encompasses (1) the CHIMERA reasoning dataset for LLMs (Zhu et al., 1 Mar 2026), (2) the Chimera compositional image dataset (Singh et al., 20 Oct 2025), and (3) the ChimeraLog multi-agent insider threat simulation dataset (Yu et al., 11 Aug 2025). Each is notable for its methodological rigor, construction pipeline, and targeted use cases.

1. CHIMERA Reasoning Dataset for LLMs

The CHIMERA reasoning dataset is a compact synthetic dataset comprising 9,225 samples intended to bootstrap and generalize the reasoning capabilities of LLMs (Zhu et al., 1 Mar 2026). Primarily designed to overcome three data-centric bottlenecks—cold-start data scarcity, limited domain coverage, and expert annotation cost—it consists of richly annotated multi-step chain-of-thought (CoT) trajectories automatically generated and validated using LLMs.

Construction Pipeline

CHIMERA's construction is orchestrated in a fully automated three-stage LLM-driven synthesis pipeline:

  1. Subject Expansion: Eight coarse subjects (mathematics, physics, computer science, chemistry, biology, history, literature, linguistics) are expanded into 1,179 unique fine-grained topics using a state-of-the-art LLM (gpt-5). The hierarchy is two-level: subject \to topic.
  2. Problem Generation and Validation: For each topic, gpt-5 drafts PhD-level reasoning questions and answers, ensuring they are self-contained, unambiguous, and verifiable. Two independent validators (gpt-5 and o4-mini) filter for well-posedness and correctness, discarding incoherent or invalid items.
  3. Solution Synthesis and Labeling: Qwen3-235B-A22B-Thinking-2507 generates detailed CoT solutions. An LLM-based correctness verifier compares the generated CoT's final answer to the reference, labeling each reasoning trajectory as correct (y=1) or not (y=0). Correct CoTs are used for supervised fine-tuning, while unsolved problems are reserved for reinforcement learning.

Domain and Taxonomy Coverage

CHIMERA aims for systematic coverage across major scientific domains. The subject breakdown is as follows:

Subject Proportion (%) Topics
Mathematics 48.3 Additive Combinatorics, p-adic Hodge Theory, Algebraic Topology, etc.
Computer Science 15 Software systems, ML, theory
Chemistry 12 Organic, inorganic, physical
Physics 10 Classical, quantum, astro
Other (biology, history, literature, linguistics) ~14 Niche subfields

All prompts and CoTs are free-form and synthetically annotated; no human annotation is involved.

Automated Evaluation and Contamination Checks

Quality control is strictly model-based:

  • Dual validators (V₁: gpt-5, V₂: o4-mini) must agree for acceptance.
  • A correctness verifier (o4-mini) labels solution trajectories.
  • To prevent benchmark contamination, n-gram Jaccard overlaps between CHIMERA and evaluation benchmarks (e.g., GPQA-Diamond, HLE) are computed, using:

Scoren=1TtTmaxsSGn(t)Gn(s)Gn(t)Gn(s)\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t\in \mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t)\cap G_n(s)|}{|G_n(t)\cup G_n(s)|}

For GPQA-Diamond and HLE, 8-gram and 13-gram overlaps are essentially zero, indicating negligible leakage.

Dataset Characteristics and Difficulty

  • Number of problems: 9,225
  • Subjects: 8; Topics: 1,179
  • Avg. prompt length: 211.1 words
  • Avg. CoT length: 1,121.4 words

Data difficulty is high: Qwen3-4B-Thinking-2507 achieves only 37.5% on CHIMERA (vs. ~88% on previous synthetic sets), affirming utility for robust generalization.

Impact on Model Performance

Post-training Qwen3-4B-Thinking-2507 on CHIMERA yields notable benchmark improvements:

Model GPQA-D AIME24 HMMT Feb’25 HLE
Qwen3-4B-base 65.8 81.6 59.2 7.3
Qwen3-4B-base + CHIMERA 70.1 86.9 65.7 9.0

Compared to larger models (e.g., DeepSeek-R1 235B), the 4B model post-trained on CHIMERA approaches or matches their performance, supporting the efficacy of the dataset's synthesis and validation strategies.

2. CHIMERA Compositional Image Synthesis Dataset

The CHIMERA dataset for compositional image generation supports the training and evaluation of models composing novel objects from semantically meaningful parts, termed "semantic atoms" (Singh et al., 20 Oct 2025). The goal is to enable explicit, fine-grained control over the assembly of object parts from multiple categories without the necessity for user-specified masks.

Semantic Atom Taxonomy

  • Part: A localized, semantically meaningful component (e.g., "wheel", "beak").
  • Subject: The object or species contributing the part (e.g., "sports car", "lion").
  • Semantic atom: Ordered pair (part, subject), e.g., "tail_of_lion".

The dataset spans six high-level domains:

Domain # Parts # Subjects Atom Examples
Creatures 8 19 "tail_of_lion"
Vehicles 8 14 "wheel_of_motorcycle"
Furniture 8 6 "leg_of_stool"
Plants 8 12 "leaf_of_palm"
Electronics 8 10 "screen_of_laptop"
Instruments 8 7 "key_of_piano"
  • Total unique semantic atoms: 464.

Data Generation Pipeline

  1. Prompt Sampling: For each example, KK distinct semantic atoms (KK\in {2,3,4}) are sampled uniformly. Prompts are generated using domain-specific templates (e.g., "A creature with...").
  2. Image Synthesis: Images (1024x1024 px) are generated using HiDream-I1-Full (open-source diffusion transformer) with deterministic prompt-based seeding. All 37,000 images generated are retained.
  3. Mask Annotation: For each image, Florence v2 or an internal annotator extracts per-part binary masks at the atom level.
  4. Schema: Each record includes the prompt, K-atom list, image file, binary masks, and part/subject labels.

No explicit train/val/test splits are specified; all data is used for training the diffusion prior.

Part-Conditional Conditioning

During training, each part crop is embedded via IP-Adapter into the IP+ space. Part embeddings c₁…c_K, along with the textual prompt, are inputs. The loss includes a multi-part prior term:

Lprior=Ee,y,tePθ(et,t,y,c1,,cK)2L_{\mathrm{prior}} = \mathbb{E}_{e, y, t} \left\| e - P_\theta(e_t, t, y, c_1,\dots,c_K) \right\|^2

No mask-IoU or part-perceptual losses are introduced.

Data Access

  • Data/files: PNG images, masks, per-record JSON annotations.
  • Directory structure is not strictly specified in the reference.
  • Dataset is publicly shareable under CC-BY-4.0; links to code and data made available post-publication.

3. ChimeraLog Multi-Agent Insider Threat Dataset

ChimeraLog is a synthetic insider-threat detection dataset generated using a multi-agent LLM framework simulating both benign and malicious enterprise activities (Yu et al., 11 Aug 2025). The simulation captures realistic organizational dynamics, producing large-scale, labeled, and diverse behavioral logs for use in ITD research.

Simulation Framework

  • Each "organization under simulation" (OUS) comprises 20 employee agents, each an "agent bundle" of two LLMs (User Agent for scheduling/communication; Assistant Agent for tool execution/memory).
  • Agents have structured profiles: role, personality traits, responsibilities, and domain knowledge.
  • Behavioral modules: meeting scheduler, pairwise interaction, dynamic scheduling (solving a time-allocation knapsack problem per agent per day).

Malicious Behavior Modeling

  • Agents are benign or assigned as Traitors, Masqueraders, or Unintentional Insiders:
    • Each malicious agent is assigned a 5W1H-style attack profile describing operational tactics (see 1.3 in (Yu et al., 11 Aug 2025)).
    • 15 insider attack types, covering atomic and hybrid (multi-vector) scenarios.

Data Modalities and Schema

Modality Example Fields
Login Records timestamp, user_id, host, success
Email Communications message_id, timestamp, from, to, subject, body
Web History timestamp, user_id, url, category
File Operations timestamp, user_id, file_path, op_type, result
Network Traffic standard PCAP (packets: src/dst IP, ports, lengths, times)
System Calls timestamp, container_id, pid, syscall_name, args, return_value

Metadata is provided in agents.json, scenario.json, and attacks.json.

Statistical Overview

  • Three organizational domains: Technology, Finance, Medical
  • Simulation: 30 days per domain, 20 agents per org
  • Event counts (aggregate):
    • Application-level: 2.0B
    • Network packets: 4.5B
    • System calls: 18.2B
    • Total: ≈25B events
  • Malicious events: ≈20% (5B), approx. uniformly across 15 attack types (\sim333M/type)

Labeling, Splits, and Human/Quantitative Evaluation

  • All agent-generated entries are automatically labeled as benign or malicious according to agent profiles; no manual annotation is conducted.
  • Splits: stratified 80/10/10% (train/val/test), maintaining benign/malicious ratio.
  • Human expert surveys rate ChimeraLog (4.20/5 realism) as comparable to TWOS (4.25), significantly above CERT (1.78).
  • Detection experiments: F1 on ChimeraLog is 0.83 (vs. 0.99 for CERT), with pronounced domain-shift effects in cross-domain transfer.

Usage Guidelines

  • Standard feature extraction: event counts and timing per user-day across modalities.
  • Evaluation: within-domain and cross-domain, using F1 metric as primary.
  • Baselines: SVM (RBF), 1D-CNN, GCN, DS-IID.

4. Comparative Dataset Properties

CHIMERA Dataset Domain Modality/Schema Size/Scale Annotation Access/License
Reasoning (Zhu et al., 1 Mar 2026) LLM reasoning 9,225 QA-CoT samples, 1,121.4w avg. CoT 8 subjects, 1,179 topics LLM-validated Synthetic, open (CC-BY)
Compositional Image (Singh et al., 20 Oct 2025) Vision 37,000 images, 464 semantic atoms, masks 6 domains, 2-4 parts/img Synthetic Synthetic, open (CC-BY-4.0)
Insider Threat (Yu et al., 11 Aug 2025) Security/ITD 25B logs, 6 modalities, labeled attacks 3 orgs, 30d, 20 agents Simulation-based Synthetic, open

A plausible implication is that the CHIMERA nomenclature is used for datasets that emphasize compositional diversity, synthetic provenance, and automated, scalable evaluation.

5. Significance and Research Impact

The CHIMERA datasets have advanced state-of-the-art data generation in their respective domains:

  • In LLM reasoning, CHIMERA demonstrates that compact yet challenging synthetic datasets with long CoTs can enable smaller models to approach large-model performance, especially when post-training combines supervised and RL methods (Zhu et al., 1 Mar 2026).
  • In vision, the compositional CHIMERA dataset facilitates part-level compositionality, enabling robust benchmarking of part-based generative controls (Singh et al., 20 Oct 2025).
  • In security, ChimeraLog sets a new bar for multi-agent, multi-modal, labeled, and diverse insider-threat simulation, supporting realistic detection research and avoiding overfitting documented with legacy datasets like CERT (Yu et al., 11 Aug 2025).

The methodological commonality is the exclusive reliance on automated, model-based data synthesis and validation, scalable annotation pipelines, and explicit focus on coverage and dataset difficulty. This suggests that "CHIMERA" is emerging as a label for datasets that are synthetic, compositional, and designed for resilient model generalization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHIMERA Dataset.