CHIMERA Dataset Overview

Updated 17 March 2026

CHIMERA Dataset is a suite of synthetic benchmarks spanning LLM reasoning, compositional image synthesis, and insider threat simulation for robust ML evaluation.
Its reasoning component employs an advanced three-stage LLM-driven pipeline to generate, validate, and label challenging chain-of-thought problems.
The image and insider threat datasets offer fine-grained compositional control and realistic multi-modal logs, advancing research in vision and security domains.

CHIMERA refers to several distinct datasets in the machine learning and security literature, each addressing specific challenges in synthetic data generation for reasoning, compositional image synthesis, and insider threat simulation. The term encompasses (1) the CHIMERA reasoning dataset for LLMs (Zhu et al., 1 Mar 2026), (2) the Chimera compositional image dataset (Singh et al., 20 Oct 2025), and (3) the ChimeraLog multi-agent insider threat simulation dataset (Yu et al., 11 Aug 2025). Each is notable for its methodological rigor, construction pipeline, and targeted use cases.

1. CHIMERA Reasoning Dataset for LLMs

The CHIMERA reasoning dataset is a compact synthetic dataset comprising 9,225 samples intended to bootstrap and generalize the reasoning capabilities of LLMs (Zhu et al., 1 Mar 2026). Primarily designed to overcome three data-centric bottlenecks—cold-start data scarcity, limited domain coverage, and expert annotation cost—it consists of richly annotated multi-step chain-of-thought (CoT) trajectories automatically generated and validated using LLMs.

Construction Pipeline

CHIMERA's construction is orchestrated in a fully automated three-stage LLM-driven synthesis pipeline:

Subject Expansion: Eight coarse subjects (mathematics, physics, computer science, chemistry, biology, history, literature, linguistics) are expanded into 1,179 unique fine-grained topics using a state-of-the-art LLM (gpt-5). The hierarchy is two-level: subject $\to$ topic.
Problem Generation and Validation: For each topic, gpt-5 drafts PhD-level reasoning questions and answers, ensuring they are self-contained, unambiguous, and verifiable. Two independent validators (gpt-5 and o4-mini) filter for well-posedness and correctness, discarding incoherent or invalid items.
Solution Synthesis and Labeling: Qwen3-235B-A22B-Thinking-2507 generates detailed CoT solutions. An LLM-based correctness verifier compares the generated CoT's final answer to the reference, labeling each reasoning trajectory as correct (y=1) or not (y=0). Correct CoTs are used for supervised fine-tuning, while unsolved problems are reserved for reinforcement learning.

Domain and Taxonomy Coverage

CHIMERA aims for systematic coverage across major scientific domains. The subject breakdown is as follows:

Subject	Proportion (%)	Topics
Mathematics	48.3	Additive Combinatorics, p-adic Hodge Theory, Algebraic Topology, etc.
Computer Science	15	Software systems, ML, theory
Chemistry	12	Organic, inorganic, physical
Physics	10	Classical, quantum, astro
Other (biology, history, literature, linguistics)	~14	Niche subfields

All prompts and CoTs are free-form and synthetically annotated; no human annotation is involved.

Automated Evaluation and Contamination Checks

Quality control is strictly model-based:

Dual validators (V₁: gpt-5, V₂: o4-mini) must agree for acceptance.
A correctness verifier (o4-mini) labels solution trajectories.
To prevent benchmark contamination, n-gram Jaccard overlaps between CHIMERA and evaluation benchmarks (e.g., GPQA-Diamond, HLE) are computed, using:

$\mathrm{Score}_n = \frac{1}{|\mathcal{T}|} \sum_{t\in \mathcal{T}} \max_{s\in\mathcal{S}} \frac{|G_n(t)\cap G_n(s)|}{|G_n(t)\cup G_n(s)|}$

For GPQA-Diamond and HLE, 8-gram and 13-gram overlaps are essentially zero, indicating negligible leakage.

Dataset Characteristics and Difficulty

Number of problems: 9,225
Subjects: 8; Topics: 1,179
Avg. prompt length: 211.1 words
Avg. CoT length: 1,121.4 words

Data difficulty is high: Qwen3-4B-Thinking-2507 achieves only 37.5% on CHIMERA (vs. ~88% on previous synthetic sets), affirming utility for robust generalization.

Impact on Model Performance

Post-training Qwen3-4B-Thinking-2507 on CHIMERA yields notable benchmark improvements:

Model	GPQA-D	AIME24	HMMT Feb’25	HLE
Qwen3-4B-base	65.8	81.6	59.2	7.3
Qwen3-4B-base + CHIMERA	70.1	86.9	65.7	9.0

Compared to larger models (e.g., DeepSeek-R1 235B), the 4B model post-trained on CHIMERA approaches or matches their performance, supporting the efficacy of the dataset's synthesis and validation strategies.

2. CHIMERA Compositional Image Synthesis Dataset

The CHIMERA dataset for compositional image generation supports the training and evaluation of models composing novel objects from semantically meaningful parts, termed "semantic atoms" (Singh et al., 20 Oct 2025). The goal is to enable explicit, fine-grained control over the assembly of object parts from multiple categories without the necessity for user-specified masks.

Semantic Atom Taxonomy

Part: A localized, semantically meaningful component (e.g., "wheel", "beak").
Subject: The object or species contributing the part (e.g., "sports car", "lion").
Semantic atom: Ordered pair (part, subject), e.g., "tail_of_lion".

The dataset spans six high-level domains:

Domain	# Parts	# Subjects	Atom Examples
Creatures	8	19	"tail_of_lion"
Vehicles	8	14	"wheel_of_motorcycle"
Furniture	8	6	"leg_of_stool"
Plants	8	12	"leaf_of_palm"
Electronics	8	10	"screen_of_laptop"
Instruments	8	7	"key_of_piano"

Total unique semantic atoms: 464.

Data Generation Pipeline

Prompt Sampling: For each example, $K$ distinct semantic atoms ( $K\in$ {2,3,4}) are sampled uniformly. Prompts are generated using domain-specific templates (e.g., "A creature with...").
Image Synthesis: Images (1024x1024 px) are generated using HiDream-I1-Full (open-source diffusion transformer) with deterministic prompt-based seeding. All 37,000 images generated are retained.
Mask Annotation: For each image, Florence v2 or an internal annotator extracts per-part binary masks at the atom level.
Schema: Each record includes the prompt, K-atom list, image file, binary masks, and part/subject labels.

No explicit train/val/test splits are specified; all data is used for training the diffusion prior.

Part-Conditional Conditioning

During training, each part crop is embedded via IP-Adapter into the IP+ space. Part embeddings c₁…c_K, along with the textual prompt, are inputs. The loss includes a multi-part prior term:

$L_{\mathrm{prior}} = \mathbb{E}_{e, y, t} \left\| e - P_\theta(e_t, t, y, c_1,\dots,c_K) \right\|^2$

No mask-IoU or part-perceptual losses are introduced.

Data Access

Data/files: PNG images, masks, per-record JSON annotations.
Directory structure is not strictly specified in the reference.
Dataset is publicly shareable under CC-BY-4.0; links to code and data made available post-publication.

3. ChimeraLog Multi-Agent Insider Threat Dataset

ChimeraLog is a synthetic insider-threat detection dataset generated using a multi-agent LLM framework simulating both benign and malicious enterprise activities (Yu et al., 11 Aug 2025). The simulation captures realistic organizational dynamics, producing large-scale, labeled, and diverse behavioral logs for use in ITD research.

Simulation Framework

Each "organization under simulation" (OUS) comprises 20 employee agents, each an "agent bundle" of two LLMs (User Agent for scheduling/communication; Assistant Agent for tool execution/memory).
Agents have structured profiles: role, personality traits, responsibilities, and domain knowledge.
Behavioral modules: meeting scheduler, pairwise interaction, dynamic scheduling (solving a time-allocation knapsack problem per agent per day).

Malicious Behavior Modeling

Agents are benign or assigned as Traitors, Masqueraders, or Unintentional Insiders:
- Each malicious agent is assigned a 5W1H-style attack profile describing operational tactics (see 1.3 in (Yu et al., 11 Aug 2025)).
- 15 insider attack types, covering atomic and hybrid (multi-vector) scenarios.

Data Modalities and Schema

Modality	Example Fields
Login Records	timestamp, user_id, host, success
Email Communications	message_id, timestamp, from, to, subject, body
Web History	timestamp, user_id, url, category
File Operations	timestamp, user_id, file_path, op_type, result
Network Traffic	standard PCAP (packets: src/dst IP, ports, lengths, times)
System Calls	timestamp, container_id, pid, syscall_name, args, return_value

Metadata is provided in agents.json, scenario.json, and attacks.json.

Statistical Overview

Three organizational domains: Technology, Finance, Medical
Simulation: 30 days per domain, 20 agents per org
Event counts (aggregate):
- Application-level: 2.0B
- Network packets: 4.5B
- System calls: 18.2B
- Total: ≈25B events
Malicious events: ≈20% (5B), approx. uniformly across 15 attack types ( $\sim$ 333M/type)

Labeling, Splits, and Human/Quantitative Evaluation

All agent-generated entries are automatically labeled as benign or malicious according to agent profiles; no manual annotation is conducted.
Splits: stratified 80/10/10% (train/val/test), maintaining benign/malicious ratio.
Human expert surveys rate ChimeraLog (4.20/5 realism) as comparable to TWOS (4.25), significantly above CERT (1.78).
Detection experiments: F1 on ChimeraLog is 0.83 (vs. 0.99 for CERT), with pronounced domain-shift effects in cross-domain transfer.

Usage Guidelines

Standard feature extraction: event counts and timing per user-day across modalities.
Evaluation: within-domain and cross-domain, using F1 metric as primary.
Baselines: SVM (RBF), 1D-CNN, GCN, DS-IID.

4. Comparative Dataset Properties

CHIMERA Dataset	Domain	Modality/Schema	Size/Scale	Annotation	Access/License
Reasoning (Zhu et al., 1 Mar 2026)	LLM reasoning	9,225 QA-CoT samples, 1,121.4w avg. CoT	8 subjects, 1,179 topics	LLM-validated	Synthetic, open (CC-BY)
Compositional Image (Singh et al., 20 Oct 2025)	Vision	37,000 images, 464 semantic atoms, masks	6 domains, 2-4 parts/img	Synthetic	Synthetic, open (CC-BY-4.0)
Insider Threat (Yu et al., 11 Aug 2025)	Security/ITD	25B logs, 6 modalities, labeled attacks	3 orgs, 30d, 20 agents	Simulation-based	Synthetic, open

A plausible implication is that the CHIMERA nomenclature is used for datasets that emphasize compositional diversity, synthetic provenance, and automated, scalable evaluation.

5. Significance and Research Impact

The CHIMERA datasets have advanced state-of-the-art data generation in their respective domains:

In LLM reasoning, CHIMERA demonstrates that compact yet challenging synthetic datasets with long CoTs can enable smaller models to approach large-model performance, especially when post-training combines supervised and RL methods (Zhu et al., 1 Mar 2026).
In vision, the compositional CHIMERA dataset facilitates part-level compositionality, enabling robust benchmarking of part-based generative controls (Singh et al., 20 Oct 2025).
In security, ChimeraLog sets a new bar for multi-agent, multi-modal, labeled, and diverse insider-threat simulation, supporting realistic detection research and avoiding overfitting documented with legacy datasets like CERT (Yu et al., 11 Aug 2025).

The methodological commonality is the exclusive reliance on automated, model-based data synthesis and validation, scalable annotation pipelines, and explicit focus on coverage and dataset difficulty. This suggests that "CHIMERA" is emerging as a label for datasets that are synthetic, compositional, and designed for resilient model generalization.

Markdown Report Issue Upgrade to Chat

References (3)

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning (2026)

Chimera: Compositional Image Generation using Part-based Concepting (2025)

Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CHIMERA Dataset.

CHIMERA Dataset Overview

1. CHIMERA Reasoning Dataset for LLMs

Construction Pipeline

Domain and Taxonomy Coverage

Automated Evaluation and Contamination Checks

Dataset Characteristics and Difficulty

Impact on Model Performance

2. CHIMERA Compositional Image Synthesis Dataset

Semantic Atom Taxonomy

Data Generation Pipeline

Part-Conditional Conditioning

Data Access

3. ChimeraLog Multi-Agent Insider Threat Dataset

Simulation Framework

Malicious Behavior Modeling

Data Modalities and Schema

Statistical Overview

Labeling, Splits, and Human/Quantitative Evaluation

Usage Guidelines

4. Comparative Dataset Properties

5. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CHIMERA Dataset Overview

1. CHIMERA Reasoning Dataset for LLMs

Construction Pipeline

Domain and Taxonomy Coverage

Automated Evaluation and Contamination Checks

Dataset Characteristics and Difficulty

Impact on Model Performance

2. CHIMERA Compositional Image Synthesis Dataset

Semantic Atom Taxonomy

Data Generation Pipeline

Part-Conditional Conditioning

Data Access

3. ChimeraLog Multi-Agent Insider Threat Dataset

Simulation Framework

Malicious Behavior Modeling

Data Modalities and Schema

Statistical Overview

Labeling, Splits, and Human/Quantitative Evaluation

Usage Guidelines

4. Comparative Dataset Properties

5. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research