FLAMES Dataset: Multimodal AI Benchmarks

Updated 28 August 2025

FLAMES Dataset is a collection of diverse, domain-specific benchmarks including wildfire imagery, industrial flame segmentation, federated robotic manipulation, math reasoning, and LLM value alignment.
It integrates varied modalities such as high-resolution RGB, thermal, synthetic images, and adversarial text prompts to challenge and refine classical and deep learning methods.
The datasets provide actionable insights through precise annotations, sensor data, and policy-driven frameworks, driving innovation in computer vision, robotics, natural language processing, and combustion diagnostics.

The term "FLAMES Dataset" encompasses several distinct datasets and benchmarks introduced under the acronym FLAMES, each contributing to different research domains such as wildfire image analysis, industrial flame segmentation, federated robotic manipulation, mathematical reasoning with LLMs, and adversarial value alignment for LLMs. These datasets are unified by their focus on challenging real-world or synthetic image, video, or text domains but differ fundamentally in modality, scope, and application. The following sections provide a systematic and comprehensive survey of the principal FLAMES datasets referenced in the academic literature.

1. Wildfire Image and Video Datasets

Multiple FLAME datasets have been established to support AI-driven wildfire detection, segmentation, and monitoring tasks, utilizing data from unmanned aerial vehicles (UAVs), thermal sensors, and generative models.

Aerial Visual and Thermal Imagery (FLAME, FLAME 3, FLAME Diffuser, AccSampler):

The original FLAME dataset comprises aerial imagery (RGB and thermal video recordings) of prescribed pile burns in pine forest environments (Shamsoshoara et al., 2020). Imagery is acquired via drones equipped with visible-spectrum cameras (DJI Zenmuse X4S, Phantom 3) and thermal sensors (FLIR Vue Pro R). The data are annotated frame-wise for both binary fire/no-fire classification and pixel-level fire segmentation, with ground truth masks available for benchmarking deep neural models.
FLAME 3 (Hopkins et al., 3 Dec 2024) extends this approach, providing paired high-resolution RGB and radiometric thermal TIFFs. The thermal data is not limited to color-mapped JPEGs but includes raw temperature values per pixel, facilitating precise segmentation and temperature regression tasks. The collection includes nadir thermal plots, 3D point clouds, and geo-referenced orthomosaics.
FLAME Diffuser (Wang et al., 6 Mar 2024) introduces a synthetic image generation framework based on mask-guided diffusion. It operates training-free, generating annotated wildfire imagery by fusing real or mathematically generated masks (augmented with Perlin noise) with raw images via a variational autoencoder. CLIP-based filtering ensures only contextually pertinent, high-quality images are retained in the synthesized dataset.
AccSampler (Zhao et al., 31 Aug 2024) repurposes the FLAME dataset for video classification, employing a lightweight deep-learning pipeline for real-time UAV monitoring. A policy network and station point concept guide adaptive frame selection and clip mixup compression, streamlining dataset creation for downstream models by distilling videos into salient representative frames.

Dataset / Method	Modality	Notable Features
FLAME (Shamsoshoara et al., 2020)	Aerial RGB/Thermal Video	Annotated frames, pixel-wise fire masks
FLAME 3 (Hopkins et al., 3 Dec 2024)	Paired RGB/Radiometric Thermal	Raw temperature TIFFs, nadir plots, 3D data
Diffuser (Wang et al., 6 Mar 2024)	Mask-guided Synthetic Imagery	VAE + CLIP filtering, paired annotations
AccSampler (Zhao et al., 31 Aug 2024)	UAV Video (compressed)	Policy-driven frame distillation/compression

2. Segmentation Benchmark for Industrial Burner Flames

The FLAMES dataset (Landgraf et al., 2023) also refers to a benchmark collection used in comparative studies of image segmentation methods for industrial burner flames.

Based on a public dataset of 3,000 grayscale images of burner flames (resolution 552 × 552), the benchmark evaluates algorithms ranging from traditional image processing (Global Thresholding, Region Growing) to machine learning (SVM, RF, MLP) and contemporary deep learning approaches (U-Net, DeepLabV3+). Labels are refined through manual curation to improve segmentation reliability. Performance is measured via intersection over union (IoU), with top scores (>93%) achieved by DeepLabV3+ architectures. Inference time and robustness against reduced dataset sizes are also documented.

Segmentation Method	IoU (approximate)	Speed (CPU/GPU)
Global Thresholding	80.3%	~0.1 ms (CPU)
Random Forest	~87%	8–125 ms (CPU)
DeepLabV3+ (ResNet-18)	93.2%	4.6 ms (GPU)

3. Federated Learning Benchmark for Robotic Manipulation

The FLAME benchmark (Betran et al., 3 Mar 2025) is designed to evaluate federated learning algorithms within the robotic manipulation domain.

Composed of over 160,000 expert demonstration episodes spanning multiple manipulation tasks (e.g., Slide Block to Target, Close Box, Insert Onto Square Peg, Scoop With Spatula) in diverse simulated environments. Each client represents a specific environment, varying color, texture, distractors, physical properties, and camera viewpoints. The federated learning process is orchestrated via the FLOWER framework, utilizing aggregation algorithms such as FedAvg, FedAvgM, FedOpt, and Krum.
Policy networks process both RGB images and state vectors to predict actions. Evaluation includes both offline RMSE matching to expert demonstrations and online RLBench deployment for success rates. Scalability, privacy, and adaptation to non-iid data are critical research axes. Ablation studies confirm that increased client diversity and episode counts generally enhance policy robustness.

4. FLAMES Dataset for Mathematical Reasoning

The FLAMES dataset (Seegmiller et al., 22 Aug 2025) is a curated synthetic dataset created to improve LLM math reasoning by systematic analysis of data synthesis strategies.

Synthesized with dynamic difficulty calibration and diversity augmentation, FLAMES integrates techniques from ten existing strategies and introduces new methods to balance complexity and coverage. The dataset contains a mix of elementary and high-difficulty mathematical problems, multi-step prompts, and variants to enhance both reasoning and generalization. Key insights reveal that maximizing coverage with problem diversity positively impacts LLM performance more than aggressive filtering for solution reliability.
Fine-tuning Qwen2.5-Math-7B on FLAMES yields an 81.4% accuracy on the MATH benchmark, outperforming several much larger models (Llama3 405B, GPT-4o, Claude 3.5 Sonnet). FLAMES delivers substantial gains on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1).

Metric	FLAMES ∆ vs. Baseline	Comparison Models
OlympiadBench	+15.7	Llama3, GPT-4o, Claude 3.5
CollegeMath	+4.5	Qwen2.5-Math-7B Fine-tuned
MATH	+3.1	Qwen2.5-Math-7B

5. Adversarial Value Alignment Benchmark for LLMs

The Flames benchmark (Huang et al., 2023) evaluates LLM value alignment, especially in the Chinese context.

Comprising 2,251 manually crafted adversarial prompts, Flames targets five dimensions: Fairness, Safety, Morality (with Chinese-specific cultural components), Data Protection, and Legality. The dataset is annotated with a fine-grained scoring system, where model responses are assessed via both manual and automated scorers (Flames-scorer trained on 22.9K responses).
Experiments on 17 mainstream LLMs reveal significant vulnerability in the Safety and Fairness dimensions, with low harmless rates (e.g., Claude at ~63.77%). The scorer backend achieves higher accuracy than GPT-4 used as a judge and supports ongoing model evaluation on a standardized subset (https://github.com/AIFlames/Flames). The benchmark exposes limitations in current model alignment protocols and sets challenges for systematic improvements toward cultural and ethical alignment.

6. Characterization of Turbulent Flame Structure

The original "Data-driven Analysis of Turbulent Flame Images" dataset (Roncancio et al., 2020) investigates transient combustion phenomena via OH-PLIF imaging.

High-speed laser-induced fluorescence images (480×640, 9 kHz) of methane/air premixed flames under varying CO₂ dilution (0%, 5%, 10%) are resized and augmented for CNN-based binary classification (presence/absence of unburned material pockets). Training utilizes augmentation, regularization, and RMSProp optimization. Achieved test accuracies are 91.72%, 89.35%, and 85.80% for increasing CO₂, with more complex flame topologies resulting in higher false-negative rates.
Applications span real-time combustion diagnostics, pollutant control, and integration into turbulent combustion simulations (e.g., LES), with further work required to improve classification beyond the flame tip.

7. Summary and Research Implications

The FLAMES family of datasets and benchmarks provides crucial infrastructure for research in real-world visual understanding, robotic manipulation, mathematical reasoning, and LLM value alignment. Each variant is defined by modality, annotation, and benchmarking protocol targeting domain-specific challenges:

Wildfire datasets offer multi-modal, high-fidelity imagery with precise temperature mapping and annotation systems for detection and segmentation.
Industrial burner segmentation benchmarks allow rigorous comparison of classical and deep-learning-based binary segmentation methods.
Robotic manipulation datasets enable scalable, privacy-preserving federated learning across distributed environments.
Synthetic data pipelines for LLMs demonstrate measurable advances in reasoning generalization, informing model training at scale.
Adversarial value alignment datasets integrate culturally and ethically nuanced evaluation metrics for LLM safety research.

The proliferation and evolution of FLAMES datasets reflect a broader methodological trend toward domain-adapted, highly annotated benchmarks that drive innovation in deep learning, computer vision, robotic control, and language modeling. This foundation supports robust evaluation, facilitates cross-domain transfer, and continually informs architectural and policy directions in applied artificial intelligence research.