BRACE Benchmark: Audio, Code & Dance

Updated 18 December 2025

BRACE Benchmark is a family of rigorously designed evaluation suites that test model robustness in audio captioning, code language modeling, and audio-conditioned dance motion synthesis.
Each benchmark employs tailored methodologies and metrics to assess performance in reference-free evaluation, energy efficiency, and generative realism.
The benchmarks reveal critical model weaknesses and inspire new research directions through precise error analysis and innovative rating systems.

The term "BRACE Benchmark" refers to multiple, independently developed benchmarks in machine learning that share the acronym BRACE but address distinct research problems: (1) robust evaluation of reference-free audio captioning, (2) energy-efficient and functionally accurate code LLMs, and (3) the challenge of audio-conditioned complex dance motion synthesis. Each benchmark is foundational in its respective subfield and establishes new standards for meta-evaluation, sustainability-aware model selection, or data-driven generative modeling. The following summary systematically delineates each of the three principal BRACE Benchmarks, their structure, evaluation methodology, and the key research insights they enable.

1. BRACE for Reference-Free Audio Caption Evaluation

BRACE (Benchmark for Robust Audio Caption Quality Evaluation) provides the first systematic meta-evaluation suite for testing reference-free Audio Caption Evaluation Metrics (ACEMs)—especially CLAPScore variants—and for measuring modality-alignment in Large Audio LLMs (LALMs) (Guo et al., 11 Dec 2025). The goal is to probe the robustness and granularity of model judgments in the absence of gold captions.

Sub-Benchmarks and Construction

BRACE comprises two orthogonal tracks:

BRACE-Main: Fine-grained, pairwise caption comparison.
BRACE-Hallucination: Detection of subtle, object-level caption hallucinations.

Dataset assembly is as follows:

Source: AudioCaps (765 clips) and Clotho (1,262 clips) test sets.
Filtering: Qwen2.5-7B-Instruct removes clips with high semantic variation across annotators.
BRACE-Main (2,496 pairs): Human–human, human–LALM, human–corrupted, generated–generated, generated–corrupted, and corrupted–corrupted pairs. LLM-based corruption (shortening, fluency errors) creates negative examples; triple-expert annotation ensures strong reliability (Fleiss’ κ up to 0.98).
BRACE-Hallucination (10,315 pairs): GPT-4o is used to replace nouns within captions, generating semantically plausible but factually incorrect object entities. Human captions are paired against hallucinated versions, with all other text held constant.

Task Definition and Metrics

BRACE-Main: Model must select which of two captions better matches a given audio.
BRACE-Hallucination: Identify whether the object-level hallucination is present.

The evaluation is cast as binary preference selection, using precision, recall, and F₁:

$\mathrm{TP} = \sum_i \mathbb{1}[\,\hat y_i=1 \,\wedge\, y_i=1],~ \mathrm{FP} = \sum_i \mathbb{1}[\,\hat y_i=1 \,\wedge\, y_i=0],~ \mathrm{FN} = \sum_i \mathbb{1}[\,\hat y_i=0 \,\wedge\, y_i=1].$

CLAPScore’s sliding-window variant, SLIDE-CLAP, averages windowed embeddings:

$\mathbf{f}_A^{\mathrm{SLIDE}} = \mathrm{Norm}\Bigl(\tfrac{1}{N}\sum_{i=1}^N \mathrm{Norm}\bigl(E_a(a_i)\bigr)\Bigr)$

where $\{a_i\}$ are audio windows.

Baseline Results and Observed Limitations

Model	BRACE-Main F₁ (All)	BRACE-Halluc. Avg F₁
SLIDE-CLAP (LAION)	74	82.9
CLAP (LAION)	73	82.9
LALM (AF2)	65	76.2
LALM (GPT-4o)	58	96.4

Key findings:

SLIDE-CLAP stabilizes CLAPScore and achieves marginal improvements.
Even top CLAP variants fall below 70.01 F₁ (mean across runs) on finely discriminative tasks.
CLAPScore is insensitive to subtle errors (e.g., syntactic errors or minor acoustic distinctions).
LALMs underperform due to prompt-induced position bias and inability to reliably localize hallucinations. Closed-source GPT-4o outperforms in hallucination detection but not main alignment selection.

Future Research Recommendations

For CLAP: Incorporate fine-grained acoustic event features and syntax-aware losses to address missing background/low-energy sounds and attenuate sensitivity to syntactically broken text.
For LALMs: Develop adversarial prompt training, prompt-debiasing, and audio-token multimodal reasoning to overcome position bias and instruction-following failures.

Public resources: https://github.com/HychTus/BRACE_Evaluation; HuggingFace dataset (Guo et al., 11 Dec 2025).

2. BRACE for Energy-Efficient and Accurate Code LLMs

BRACE (Benchmarking LLMs on Functional Accuracy and Energy Efficiency) provides an integrated framework for evaluating code LLMs (CLMs) along both functional correctness and inference-time energy consumption (Mehditabar et al., 10 Nov 2025). The benchmark enables multi-criteria model selection using two novel 1–5 scale rating strategies.

Rating Methodologies

CIRC (Concentric Incremental Rating Circles): Deterministic, outlier-robust mapping of normalized efficiency ( $x$ ) and accuracy ( $y$ ) to a rating via Euclidean distance to the ideal (1,1) point in $[0,1]\times[0,1]$ . Ratings partition $[0,\sqrt{2}]$ into five equal rings:

$R^\text{circ}_i = 5 - \lfloor d_i / (\sqrt{2}/5) \rfloor$

OTER (Observation to Expectation Rating): Peer-aware, dynamic normalization. The OTER score $v_i = y_i / f(x_i)$ uses a convex fit $f$ to empirical energy–accuracy tradeoff. Raw scores $[v_\text{min},v_\text{max}]$ are quantized into five equal intervals.

Data and Protocol

22 CLMs (e.g., Seed-Coder, Qwen2.5-Coder, deepseek-coder, CodeLlama, StarCoder, etc.).
Tasks: Code generation (LiveCodeBench, pass@1), code summarization (CodeXGLUE, smoothed BLEU).
Energy: Measured via CodeCarbon at 1 s, summing GPU/CPU/RAM usage. Min–max normalization used for both axes.
Each model yields (acc_norm, eff_norm). Ratings are produced via CIRC and OTER for both tasks.

Empirical Insights

BLEU scores (summarization) consistently exceed pass@1 (generation), reflecting lower stringency with non-compilable output.
Model size is not predictive of BRACE rating (Kruskal–Wallis, $p > 0.05$ for both scales/tasks).
Models maximizing both dimensions (e.g., Seed-Coder-8B-Instruct: (1.0, 0.88) ⇒ CIRC=5, OTER=5) outperform large but inefficient models (e.g., Yi-Coder-9B: CIRC=1, OTER=1).
CIRC is robust and static; OTER is sensitive to the empirical Pareto frontier and may shift with dataset/model changes.

Model	Task	acc	eff	CIRC	OTER
Seed-Coder-8B-Instruct	Generation	1.0	0.88	5	5
granite-8b-code-base-4k	Summarization	1.0	0.14	2	5
deepseek-coder-1.3b-base	Summarization	mod.	mod.	4	5

Selection Guidance

CIRC: Use for stable, catalog-based or regulatory reporting (persistent ring boundaries).
OTER: Use to reward models that perform unusually well at their efficiency level (dynamic frontier-based normalization).

Summary: BRACE (CLM) uniquely enables Pareto-optimal, evidence-based, and sustainability-aware model selection for computational research and deployment (Mehditabar et al., 10 Nov 2025).

3. BRACE Benchmark for Audio-Conditioned Dance Motion Synthesis

BRACE (Breakdancing Competition Dataset for Dance Motion Synthesis) is a high-fidelity, in-the-wild video dataset designed specifically to challenge and evaluate audio-to-motion generative models on complex, acrobatic dance styles (Moltisanti et al., 2022). It provides rigorous ground truth for 2D skeleton-based modeling under highly dynamic and occluded conditions.

Dataset Composition

Sourced from Red Bull BC One competitions (2011–2020): 81 videos, 465 sequences, 3 h 32 m, 334,538 frames, 64 unique dancers.
Keypoints: 2D skeleton, 17 COCO joints per frame; per-frame normalization via enclosing box.
Annotation Pipeline: Hybrid (automatic: HTC, HRNet, NMS, IoU tracking; manual: expert correction/segmentation), yielding high-confidence tracks and temporally smoothed/corrected poses. Outlier correction and degree-7 Bézier interpolation address discontinuities.
Quality: On 1,472-frame manual test, MAE = 60 px (raw), 35 px (interpolated) on 1920×1080 frames. Error rates: auto (0.63%), manual (0.12%).

Benchmarking Protocols & Evaluation Metrics

Primary task: Audio-conditioned 2D skeleton synthesis (inputs: Mel-spectrograms/MFCCs ± pose seed; outputs: $[T \times 17 \times 2]$ keypoints).
Segment labels: Toprock (~25.5%), Footwork (~39.7%), Powermove (~34.8%).
Metrics:
- Per-joint L2 error:
$E_{L2} = \frac{1}{TJ} \sum_{t=1}^T \sum_{j=1}^J \| \hat{x}_{t,j} - x_{t,j} \|_2$ - Velocity error: $E_\text{vel}$ - Pose distribution Fréchet distance (pose-FID) - Beat alignment: Fraction of kinematic peaks within $\epsilon$ ms of audio beats - Beat DTW: Time-warping cost between audio and motion beats - Element distribution matching: Classifier-driven framewise segment matching

Baseline and Observed Challenges

Method	pose-FID↓	Beat Align↑	Beat DTW↓
Dance Revolution (’21)	0.5158	0.264	11.88
AIST++ (’21)	0.5743	0.136	12.92
Dancing2Music (’19)	0.5884	0.129	11.60
GT Reference	0.0032	0.451	36.50

State-of-the-art methods perform far worse than ground truth on pose-FID, indicating generated breakdance motions lack realism and diversity when facing extreme poses and in-the-wild visuals.
All models underperform in beat alignment relative to GT, sometimes achieving artificially low DTW costs due to over-regularization.
Output distributions exhibit bias toward easier (toprock) segments and away from powermoves.

Structural, Algorithmic, and Domain Challenges

Extreme acrobatics, self-occlusion, and tangled postures push both pose estimation pipelines and generative architectures beyond conventional capabilities.
Camera motion (panning, zooming), lighting variability, and multi-camera transitions present normalization and tracking difficulties not present in existing dance datasets.
Music–movement connection is weaker than assumed, necessitating strong motion priors and explicit rhythmic constraints in generative models.

Suggested Research Directions

Hierarchical/segment-level models: Conditioning by segment type and enforcing plausible global orderings.
Kinematic/dynamic constraints: Incorporation of physics-motivated objectives (e.g., contact, energy, SMPL parameterization).
Rhythm-aware architectures: Learning relative positional encodings of music–motion beat events.
User control: Interactive interfaces for specifying segment order/type.
Pose estimation adaptation: Domain-specific tuning for breakdancing's unique motion set.

BRACE establishes a new challenge standard for robust evaluation and development in dance motion synthesis under real-world complexity (Moltisanti et al., 2022).

Each BRACE instantiation frames new directions for model evaluation:

Audio–Language Alignment (BRACE–audio-caption): Crucial for algorithmic oversight, accessibility, and scalable evaluation in domains lacking gold-standard references.
Machine-generated Code (BRACE–CLM): Satisfies the increasing demand to balance sustainability with task performance in software engineering and automated programming.
Embodied Music–Motion Synthesis (BRACE–dance): Enables research into structure-aware, realistic movement generation applicable to virtual avatars, robotics, and expressive media.

This taxonomy indicates that “BRACE Benchmark” is not a unified protocol but a family of rigorously curated, high-impact evaluation standards for disparate machine learning subfields.

5. Public Access and Impact

All described BRACE resources are publicly available. For the audio captioning suite: https://github.com/HychTus/BRACE_Evaluation and the HuggingFace datasets repository support direct community adoption (Guo et al., 11 Dec 2025). The code-language BRACE offers full rating formulas and leaderboard data (Mehditabar et al., 10 Nov 2025). The dance motion dataset provides annotated videos and split recommendations (Moltisanti et al., 2022).

BRACE benchmarks have catalyzed new research on model robustness, cross-modal alignment, energy-accuracy tradeoffs, and generative realism in complex structured data. The composition and outcomes of each suite systematically expose unrevealed model weaknesses, enabling more informed and scientifically grounded model development across domains.

PDF Markdown Chat (Pro)

References (3)

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation (2025)

Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency (2025)

BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to BRACE Benchmark.