StyleBench-S: Human-Calibrated Synthetic Supervision

Updated 4 July 2026

StyleBench-S is a human-calibrated synthetic supervision set defined in the StyleID framework for stylization-agnostic facial identity recognition.
It comprises approximately 224,000 stylized images across 4,073 identities, filtered via 2AFC psychophysical experiments to ensure over 90% recognition accuracy.
The dataset underpins robust training with angular and contrastive loss components, bridging human perceptual data with scalable synthetic supervision.

StyleBench-S denotes distinct concepts in recent arXiv literature, but only one of them is a formally defined dataset. In the StyleID framework, StyleBench-S is the synthetic supervision set used to train a stylization-agnostic facial identity encoder from human-calibrated psychometric recognition statistics (Yun et al., 23 Apr 2026). By contrast, in "StyleBench: Evaluating thinking styles in LLMs" the term is not introduced as an official benchmark split; the paper defines only StyleBench, and any use of “StyleBench-S” in that context refers informally to evaluation restricted to the small-scale model group below 5B parameters (Guo et al., 25 Sep 2025). More speculative uses of the label also appear as conceptual extrapolations in discussions of cross-style language understanding and web interface evaluation rather than as formal benchmark names (Kang et al., 2019, Lai et al., 5 Oct 2025).

1. Formal status and nomenclature

The clearest formal definition of StyleBench-S appears in the StyleID paper, which introduces StyleBench as a two-part framework for stylization-agnostic facial identity recognition under human perception (Yun et al., 23 Apr 2026). In that framework, StyleBench-H is the human-annotated benchmark of same/different identity judgments for stylized faces, whereas StyleBench-S is the synthetic supervision set derived from human recognition-strength curves. Its function is explicitly supervisory rather than evaluative: it is used to calibrate an identity encoder so that similarity orderings align with human perception across stylization methods, artistic styles, and strengths (Yun et al., 23 Apr 2026).

In the LLM reasoning benchmark "StyleBench: Evaluating thinking styles in LLMs," the situation is different. The paper introduces only one benchmark, StyleBench, and never defines a split named StyleBench-S, StyleBench-L, or any analogous variant. The text explicitly states that the “S” label is not used or defined anywhere in the paper or appendices (Guo et al., 25 Sep 2025). Accordingly, “StyleBench-S” in that literature is an interpretive shorthand for StyleBench evaluation restricted to the small-scale model regime, not a formally named sub-benchmark.

The term also appears in neighboring discussions as an inferred label rather than a canonical dataset name. The xSLUE paper does not define StyleBench-S at all; it is introduced only as a hypothetical style-benchmarking analogy in secondary explanation (Kang et al., 2019). Likewise, WebRenderBench does not define a component named StyleBench-S, though its style-consistency axis has been interpreted as supporting a style-focused sub-benchmark concept (Lai et al., 5 Oct 2025). This suggests that the term is polysemous across communities and must be interpreted by paper context.

2. StyleBench-S in StyleID

Within StyleID, StyleBench-S is the “supervision half” of the overall framework. The paper defines it as “a large-scale synthetic training set derived from human recognition statistics” that preserves the trends observed in StyleBench-H while enabling efficient training of a style-robust identity encoder (Yun et al., 23 Apr 2026). Its purpose is to scale supervision beyond what could be obtained from direct human annotation alone while retaining perceptual alignment.

The dataset is organized by identity. It contains 4,073 identities, each associated with 55 stylized images, for approximately 224,000 stylized images in total (Yun et al., 23 Apr 2026). These images are generated from FFHQ source faces and included only when stylization settings fall within strength ranges where humans still recognize identity with high probability, approximately at or above 90% recognition (Yun et al., 23 Apr 2026). The resulting supervision set is synthetic in image generation but human-calibrated in its inclusion criteria.

The distinction between supervision and benchmark is central. StyleBench-H stores human same/different verification judgments and is used only for evaluation. StyleBench-S stores stylized images with identity labels and method/style/strength metadata, but not raw human judgments. Human psychophysical measurements determine which stylized images are safe to include as positive same-identity examples, after which the dataset functions as a scalable training set for identity calibration (Yun et al., 23 Apr 2026).

A concise summary of the two components is as follows:

Component	Role	Contents
StyleBench-H	Benchmark	Human same/different identity judgments
StyleBench-S	Supervision set	Synthetic stylized faces filtered by human-calibrated recognition statistics

This division underlies the StyleID pipeline: first measure human judgments, then derive recognition-strength curves, then construct StyleBench-S using only high-recognition stylization conditions, and finally train on StyleBench-S and evaluate on StyleBench-H (Yun et al., 23 Apr 2026).

3. Psychophysics protocol and construction

StyleBench-S is derived from a dedicated psychophysics-style calibration study based on two-alternative forced-choice, or 2AFC, experiments (Yun et al., 23 Apr 2026). The participant pool comprised 76 recruited participants, of whom 72 were retained after consistency checks. Each participant answered 61 queries, yielding 4,315 valid 2AFC trials after latency filtering (Yun et al., 23 Apr 2026).

Each trial contains a source photo from FFHQ and two stylized candidate images. One stylized image depicts the same identity as the source, and the other is a distractor showing a different identity but matched in stylization method, artistic style, and strength so that low-level cues remain comparable (Yun et al., 23 Apr 2026). Participants select which stylized face depicts the same person as the source. For a given stylization method $m$ , style $t$ , and strength $s \in \{1/7, 2/7, \dots, 7/7\}$ , the fraction of correct selections defines the empirical recognition probability $P_{\text{recog}(m,t,s)}$ (Yun et al., 23 Apr 2026).

These empirical points define recognition-strength curves for each method-style pair. The paper does not fit an explicit parametric form such as a logistic curve; instead, it uses the measured per-strength recognition probabilities directly as psychometric functions (Yun et al., 23 Apr 2026). The authors compare different thresholds for defining perceptual identity preservation and conclude that 70% recognition is too permissive because participants can succeed using coarse semantic cues such as gender, despite visible identity drift. They therefore choose settings where human recognition remains high, approximately above 90%, and retain the highest and second-highest recognition strengths above that threshold for each method-style pair (Yun et al., 23 Apr 2026).

This thresholding step is the mechanism by which raw human perception becomes scalable supervision. The selected stylized images are not merely synthetic positives; they are positives constrained to lie within a human-validated perceptual safe zone. A plausible implication is that the supervision signal is less about absolute style realism than about preserving identity under stylistic variation.

4. Data composition and supervision semantics

The source identities for StyleBench-S come from FFHQ, filtered to include single-person, high-quality portraits without large head rotations (Yun et al., 23 Apr 2026). Identity overlap is explicitly removed across StyleBench-S, StyleBench-H, and SKSF-A, the artist-drawn evaluation set, so the supervision set and evaluation sets are disjoint at the identity level (Yun et al., 23 Apr 2026).

Stylized images are generated using three controllable stylization frameworks: IP-Adapter, InstantID, and InfiniteYou (Yun et al., 23 Apr 2026). Each method is paired with 10 artistic styles, and stylization strength is discretized into seven levels from $1/7$ to $7/7$. Before thresholding, this yields 210 stylization configurations per identity. After psychometric filtering, each identity retains 55 stylized images generated under method-style-strength configurations that satisfy the human-calibrated threshold (Yun et al., 23 Apr 2026).

Each sample in StyleBench-S carries an identity label and metadata specifying stylization method, artistic style, and stylization strength (Yun et al., 23 Apr 2026). The dataset does not store per-image scalar human recognition probabilities or fitted psychometric parameters. Instead, human recognition statistics determine inclusion, after which the included stylized images function as identity-preserving exemplars.

The supervision semantics are twofold. First, the dataset can be treated as a multi-class identity dataset in which every stylized image is labeled with one of 4,073 identity IDs. Second, it induces positive and negative pair structure for contrastive learning: any same-identity pair is positive, even if generated by different methods, styles, or strengths, and any different-identity pair is negative (Yun et al., 23 Apr 2026). This means that the human psychophysics enters not as direct pairwise annotation, but as a principled filter that defines which stylized samples are admissible for learning style-agnostic identity.

5. Training StyleID with StyleBench-S

StyleBench-S is the sole training set used to adapt CLIP into the StyleID encoder (Yun et al., 23 Apr 2026). The backbone is the CLIP-L ViT image encoder, which remains frozen. LoRA modules are inserted into attention and linear layers and trained as a lightweight adaptation (Yun et al., 23 Apr 2026). On top of the embedding, the model uses an ArcFace-style angular classification head with identity centers for the 4,073 classes.

Training batches are identity-balanced. The reported batch size is 112, with 56 identities sampled per minibatch and 2 stylized images per identity (Yun et al., 23 Apr 2026). This construction guarantees at least one same-identity positive for supervised contrastive learning.

The total loss combines three terms: angular identity loss, supervised contrastive loss, and embedding regularization (Yun et al., 23 Apr 2026). The ArcFace term imposes class-level angular discrimination across identities, the supervised contrastive term tightens instance-level clustering of stylized images from the same identity, and the regularizer constrains adapted embeddings to remain near the frozen original CLIP representation. The total objective is

$\mathcal{L} = \mathcal{L}_{\text{ang}} + \lambda_{\text{scon}} \mathcal{L}_{\text{scon}} + \lambda_{\text{reg}} \mathcal{L}_{\text{reg}},$

with $\lambda_{\text{scon}} = 0.6$ and $\lambda_{\text{reg}} = 0.1$ (Yun et al., 23 Apr 2026). For the ArcFace head, the paper reports margin $m = 0.5$ and scale $t$ 0 (Yun et al., 23 Apr 2026).

The training setup uses AdamW with learning rate $t$ 1, LoRA rank 8, and 30,000 training iterations on a single NVIDIA A6000 GPU (Yun et al., 23 Apr 2026). The design rationale is explicit: StyleBench-S provides identity-labeled, style-diverse, human-calibrated positives, while the three-part loss exploits that structure at class, instance, and regularization levels.

6. Empirical effects and evaluation outcomes

The empirical significance of StyleBench-S is assessed indirectly through the performance of models trained on it. On the human benchmark StyleBench-H, StyleID trained on StyleBench-S achieves strong results across Cross-ID, Cross-Style, and Cross-Method evaluations (Yun et al., 23 Apr 2026). In the main comparison on StyleBench-H Cross-ID, the paper reports TPR@1e-2 of 0.9020, AUROC of 0.9711, and accuracy up to 0.9347 at threshold 0.3 for StyleID (Yun et al., 23 Apr 2026). The same training regime also generalizes to SKSF-A, where StyleID reaches TPR 0.8891 and AUROC 0.9922 on artist-drawn sketches (Yun et al., 23 Apr 2026).

These gains are attributed to StyleBench-S because it is the only training signal used in the calibration phase. The paper also reports backbone replacement experiments in which ArcFace* and AdaFace* fine-tuned on StyleBench-S improve modestly, while CLIP-based StyleID trained on the same supervision set performs best on TPR and AUROC across StyleBench-H and SKSF-A (Yun et al., 23 Apr 2026). This indicates that the structure of StyleBench-S is necessary but not sufficient; backbone choice remains consequential.

Ablation results reinforce this interpretation. All ablated models are trained on StyleBench-S, and removing either the angular loss or the supervised contrastive loss reduces TPR and AUROC, while the full combination of ArcFace, contrastive, and regularization performs best overall (Yun et al., 23 Apr 2026). This suggests that StyleBench-S contains sufficient diversity within each identity to benefit from both class-level and instance-level supervision.

The broader implication is that a human-calibrated synthetic dataset can bridge the gap between small human-labeled perceptual benchmarks and scalable training objectives. In StyleID, StyleBench-S operationalizes that bridge.

7. Other uses of the name and interpretive cautions

The term StyleBench-S should not be assumed to have a universal meaning across arXiv papers. In the LLM reasoning benchmark StyleBench, no named split with that title exists (Guo et al., 25 Sep 2025). The closest concept is the small-scale model regime, comprising Gemma3-270M, Qwen2.5-0.5B, DeepSeek-R1-Distill-Qwen-1.5B, Gemma-2B, Qwen2.5-3B, and Phi-3-Mini-3.8B, evaluated across five reasoning styles and five tasks (Guo et al., 25 Sep 2025). In that regime, the paper reports that small models struggle on AIME and Game of 24, often default to guessing, frequently violate formatting constraints, and generally do not benefit much from search-based prompting styles such as AoT and ToT (Guo et al., 25 Sep 2025). But these findings concern StyleBench restricted to small models, not a formal benchmark called StyleBench-S.

The xSLUE paper offers a different kind of caution. It studies cross-style language understanding and argues that style is multi-dimensional rather than a single variable, covering 15 styles and 23 sentence-level tasks with a diagnostic set annotated for all styles on the same text (Kang et al., 2019). References to a hypothetical “StyleBench-S” in explanatory material around xSLUE are therefore analogical rather than terminological. This suggests that “StyleBench-S” may sometimes function as an editor’s shorthand for style-focused benchmarking rather than a published artifact.

A similar issue arises in WebRenderBench. The paper introduces a layout-style consistency metric consisting of RDA, GDA, and SDA, where SDA measures style difference of associated elements based on rendered DOM and CSS attributes such as foreground color, background color, font size, and border radius (Lai et al., 5 Oct 2025). Although one can plausibly interpret SDA as a style-benchmarking axis, the paper does not define an official sub-benchmark named StyleBench-S (Lai et al., 5 Oct 2025).

Taken together, these usages establish an important editorial principle: in current arXiv literature, “StyleBench-S” is formally defined only in the StyleID framework. Elsewhere it is either absent, informal, or purely conceptual.

8. Limitations and future directions

For the StyleID meaning of StyleBench-S, the paper identifies several limitations. The dataset inherits synthetic bias because stylized images are generated only by IP-Adapter, InstantID, and InfiniteYou, even though evaluation includes additional stylization mechanisms in Cross-Method testing (Yun et al., 23 Apr 2026). Demographic skew is also a concern because the source images derive from FFHQ, and the paper notes skew toward young, white subjects in the broader framework (Yun et al., 23 Apr 2026). Stylization coverage remains limited to three families, and robustness to extreme pose and occlusion is not directly supervised because such cases were filtered out to stabilize psychophysical annotation (Yun et al., 23 Apr 2026).

These limitations motivate several future directions explicitly discussed in the paper: broader demographic coverage, more stylization families, hybrid human-synthetic supervision, and expanded robustness evaluation beyond identity under style variation (Yun et al., 23 Apr 2026). A plausible implication is that future versions of StyleBench-S may move from thresholded inclusion toward richer supervision that preserves more of the psychometric curve rather than only safe positive regions.

For the LLM-related informal use of StyleBench-S, the limitations are different. Since the term is unofficial, comparisons using it risk obscuring the fact that the published benchmark is StyleBench, not a sanctioned family of S/L splits (Guo et al., 25 Sep 2025). For cross-style language and web UI contexts, the limitation is terminological drift: the label can be rhetorically useful but may conceal differences in modality, task, and supervision assumptions (Kang et al., 2019, Lai et al., 5 Oct 2025).

In current usage, then, StyleBench-S is best understood as a context-dependent label. In facial identity recognition, it names a concrete large-scale human-calibrated supervision dataset central to the StyleID pipeline (Yun et al., 23 Apr 2026). In reasoning-style evaluation for LLMs, it denotes only an informal restriction of StyleBench to small models (Guo et al., 25 Sep 2025). In other domains, it remains a conceptual extension rather than a formal benchmark artifact (Kang et al., 2019, Lai et al., 5 Oct 2025).