Saliency Benchmark (SalBench) Overview

Updated 4 July 2026

SalBench is a collection of evaluation protocols that standardize saliency prediction across different modalities and task settings.
It encompasses single-image SOD, RGB-T, low-level pop-out evaluation, and LVLM assessments, highlighting varied datasets and metrics.
SalBench exposes challenges like center bias, scene complexity, and model failure modes, shaping future research in saliency detection.

Searching arXiv for "SalBench" and closely related benchmark papers to ground the article in current literature. “Saliency Benchmark (SalBench)” refers not to a single universally fixed resource, but to a family of benchmarks and benchmark-centered formulations used across several subareas of saliency research. In the literature, the name has been applied to at least four distinct but related settings: single-image salient object detection and segmentation, RGB-T saliency detection, low-level saliency evaluation for large vision-LLMs, and synthetic pop-out evaluation of bottom-up attention models (Borji et al., 2015). Across these usages, the common function of SalBench is to provide a controlled evaluation protocol for saliency-related predictions, but the operational definition of “saliency,” the task format, the datasets, and the metrics vary substantially with the target problem (Dahou et al., 7 Jul 2025).

1. Historical emergence and conceptual scope

The 2015 benchmark “Salient Object Detection: A Benchmark” established one of the most widely cited uses of SalBench in the salient object detection literature (Borji et al., 2015). Its stated purpose was “a systematic, large-scale, and fair comparison of single-image salient object detection (SOD) and segmentation algorithms,” with two central tasks: salient object detection, in which a model outputs a continuous saliency map $S(x,y)\in[0,255]$ , and salient object segmentation, in which that map is converted into a binary mask $M(x,y)$ that delineates foreground from background. The benchmark formalized the distinction between salient object detection and related areas such as fixation prediction and objectness proposals by showing that methods designed specifically for salient object detection “generally work better than models in closely related areas” (Borji et al., 2015).

A later and conceptually different use of the same name appears in “Vision-LLMs Can’t See the Obvious,” which introduces SalBench as “a novel benchmark designed to assess the capability of Large Vision-LLMs (LVLM) in detecting visually salient features that are readily apparent to humans” (Dahou et al., 7 Jul 2025). Here the emphasis is not figure/ground segmentation but pop-out effects in low-level visual dimensions such as color, intensity, and orientation. This benchmark is explicitly positioned as complementary to VQA, captioning, and reasoning benchmarks, which focus on high-level semantics rather than whether a model can detect “the obvious.”

The name also appears in the RGB-T saliency literature. “A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach” uses SalBench to denote a benchmark of 821 spatially aligned RGB-T image pairs with challenge annotations for evaluating saliency detection under complementary visible and thermal modalities (Li et al., 2017). In this usage, saliency benchmarking is tied to multi-modal fusion and challenge-sensitive analysis rather than to single-image RGB-only SOD.

A further usage appears in work based on SID4VAM, where “SalBench” denotes a benchmark on synthetic pop-out patterns for bottom-up attention models (Berga et al., 2019). This version emphasizes full control over low-level features, psychophysical consistency, and target-distractor pop-out phenomena rather than natural-image semantics.

This multiplicity suggests that “SalBench” functions as a benchmark label shared across adjacent research programs rather than as a singular canonical dataset. A plausible implication is that careful disambiguation is necessary whenever the term appears in the literature.

2. Single-image salient object detection benchmark

In the 2015 SOD benchmark, SalBench is organized around seven publicly available datasets selected to stress different aspects of model performance (Borji et al., 2015). These are MSRA10K, THUR15K, ECSSD, JuddDB, DUT-OMRON, SED2, and PASCAL-S. The benchmark description identifies dataset-specific properties such as “mostly single large object, strong center bias” for MSRA10K, “structurally complex natural images” for ECSSD, “multiple objects, low center bias” for JuddDB, “high scene complexity” for DUT-OMRON, and “multiple-object scenes, measured by eye-tracking” for PASCAL-S. The paper also notes that average annotation maps illustrate varying degrees of center bias, ranging from highly center-biased datasets to SED2, which is described as “almost anti-centered.”

A total of 41 models are evaluated, divided into four categories: 29 salient object detection methods, 10 fixation prediction methods, 1 objectness proposal method, and 1 baseline (Borji et al., 2015). The baseline is AAM, the “Average Annotation Map,” which captures dataset center bias. The benchmark thereby makes explicit that saliency performance can be confounded by positional priors and that a center-bias baseline can remain surprisingly competitive on center-biased data.

The quantitative summary reports top-five rankings under several criteria. By mean Max $F_\beta$ across datasets, the top five are DRFI, QCUT, RBD, ST, and DSR. By AUC, the top five are DRFI, DSR, QCUT, RBD, and MC. By smallest MAE, the top five are RBD, DSR, DRFI, QCUT, and HDCT. When averaging six primary scores—Max $F_\beta$ , Adp $T$ $F_\beta$ , $F_\beta^w$ , AUC, $1-\mathrm{MAE}$ , and SaliencyCut $F_\beta$ —the overall top six are DRFI, QCUT, RBD, ST, DSR, and MC (Borji et al., 2015).

The benchmark also reports a runtime–accuracy trade-off on MSRA10K for $400\times300$ images using an Intel Xeon E5645 2.4 GHz processor. HC, GC, and SR are the fastest methods at 0.017 s, 0.037 s, and 0.040 s per image, respectively, whereas DRFI, the top-accuracy method, runs at approximately 0.70 s per image. The benchmark notes that region-based methods such as DRFI, MC, and RBD strike “a favorable balance of accuracy vs. speed” (Borji et al., 2015).

These results are used to support a broader conclusion: bespoke salient object detection methods outperform fixation and objectness methods, and region-based, background-prior, data-driven designs dominate the leaderboard. The paper further suggests that “fully supervised or deep-learning architectures will be even more powerful,” while noting early promise from deep convolutional networks fine-tuned for SOD (Borji et al., 2015).

3. Evaluation metrics and benchmark methodology

The 2015 SalBench adopts a metric suite centered on thresholded saliency maps, binary masks, ROC behavior, and continuous-map error (Borji et al., 2015). Let $M(x,y)$ 0 be the predicted saliency map normalized into $M(x,y)$ 1, $M(x,y)$ 2 the binary ground-truth mask, and $M(x,y)$ 3 its binarization at threshold $M(x,y)$ 4. The benchmark reports Precision and Recall for each threshold, with

$M(x,y)$ 5

It then defines

$M(x,y)$ 6

with $M(x,y)$ 7, and reports both Max $M(x,y)$ 8 over thresholds and adaptive-threshold $M(x,y)$ 9 at $F_\beta$ 0.

The same benchmark includes ROC and AUC, where TPR equals Recall and FPR is computed as $F_\beta$ 1, as well as Mean Absolute Error:

$F_\beta$ 2

It additionally uses Margolin et al.’s weighted $F_\beta$ 3-measure, $F_\beta$ 4, described as an extension of $F_\beta$ 5 to continuous saliency maps that reweights errors by spatial significance and neighborhood (Borji et al., 2015).

The benchmark explicitly discusses weaknesses in classical score choices. It states that “the classic $F_\beta$ 6 and ROC/AUC can be misleading when the negative set vastly exceeds positives,” and that PR curves are more informative in such cases. It presents adoption of $F_\beta$ 7 as a way to “unify continuous/binary map evaluation and penalize boundary and spatial errors more appropriately” (Borji et al., 2015).

A closely related but more general metric argument appears in “Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics,” which is concerned with fixation prediction rather than salient object detection (Kümmerer et al., 2017). That work formalizes a distinction among a saliency model, a saliency map, and a saliency metric, and argues that no single saliency map can perform well under all metrics. Using Bayesian decision theory, it defines a saliency model as a probability density prediction $F_\beta$ 8 and a saliency map $F_\beta$ 9 as a metric-specific prediction derived from that density. It then derives optimal maps for AUC, sAUC, NSS, CC, SIM, KL-Div, and IG. The resulting claim is that benchmarking should evaluate each model at its own metric-optimal operating point rather than with a single hand-tuned map (Kümmerer et al., 2017).

This suggests that the methodological role of SalBench is not limited to publishing datasets and leaderboards. It also includes a recurring concern with what, precisely, is being evaluated: a probabilistic model, a thresholded segmentation, a continuous score field, or a metric-specific decision surface.

4. Bias, failure modes, and diagnostic analysis

A major analytical contribution of the 2015 SalBench is its examination of center bias and scene complexity (Borji et al., 2015). Center bias is operationalized through AAM, which “crystallizes the dataset’s positional prior.” On a subset of 1,000 off-center MSRA10K images, where object centroid distance exceeds 0.247 to the image center, overall performance drops only modestly, but models with strong location prior suffer large degradations, whereas DRFI and DSR remain robust. SED2, a naturally off-center dataset, also yields lower precision and recall overall, yet the same top models retain their ranking.

Scene complexity is measured by the average number of superpixels on foreground versus background using Felzenszwalb segmentation (Borji et al., 2015). JuddDB and DUT-OMRON exhibit the highest background clutter, approximately 493 superpixels, and correspondingly produce uniformly lower scores across all models. Conversely, SED2 is described as an “easy” set in terms of segmentation because it has few superpixels, though it still challenges models by always containing two salient objects.

The benchmark’s qualitative failure analysis is equally diagnostic. Models perform well when “a single, well-contrasted object dominates a relatively uniform background,” but fail under several recurrent conditions: low foreground–background contrast, salient objects touching image borders, scenes with multiple small objects or complex topology, and semantically salient items such as faces that lack the low-level contrast cues exploited by many models (Borji et al., 2015). The note that MC’s pseudo-background prior breaks when the salient object touches image borders is particularly relevant because it ties a failure mode directly to a specific modeling assumption.

Later benchmark work generalized these concerns about dataset bias. “Salient Objects in Clutter” identifies a “serious design bias” in existing SOD datasets: the assumption that each image contains at least one clear and uncluttered salient object (Fan et al., 2021). Its SOC dataset therefore includes both salient and non-salient images and adds challenge annotations such as clutter, occlusion, out-of-view, and shape complexity. The paper argues that improving the dataset can yield larger gains than focusing only on decoder design, and it provides an updated benchmark of 100 models (Fan et al., 2021).

Taken together, these analyses establish that benchmark design in saliency is inseparable from the statistical properties of the data. A plausible implication is that leaderboard comparisons without bias diagnostics may overstate progress.

5. Extensions of SalBench beyond RGB still images

The RGB-T SalBench extends saliency benchmarking to paired visible and thermal imagery (Li et al., 2017). It contains 821 RGB-T image pairs captured using a SONY CCD camera and a FLIR thermal imager under approximately 60 different scenes, including sunny, snowy, and nighttime conditions. Spatial alignment is obtained via manually picked point correspondences and a least-squares homography, yielding pixel-level alignment accuracy sufficient for saliency annotation. Ground truth consists of binary masks annotated at the superpixel or pixel level by manually tracing object boundaries on the modality where the object is most clearly visible.

This benchmark also introduces 11 challenge attributes: BSO, SSO, MSO, LI, BW, CB, CIB, SA, TC, IC, and OF (Li et al., 2017). These attributes support challenge-sensitive analysis under conditions such as low illumination, similar appearance, thermal crossover, clutter, and cross-image-boundary cases. The benchmark implements three categories of baselines—RGB-only, thermal-only, and RGB-T fusion—each built by applying 12 standard saliency detectors with default parameters. On the full benchmark, the best RGB-only baseline is CA with $F_\beta$ 0, $F_\beta$ 1, $F_\beta$ 2, and $F_\beta$ 3; the best thermal-only baseline is GMR with $F_\beta$ 4, $F_\beta$ 5, $F_\beta$ 6, and $F_\beta$ 7; and the best RGB-T fusion baseline is GMR on concatenated features with $F_\beta$ 8, $F_\beta$ 9, $T$ 0, and $T$ 1. The proposed multi-task manifold ranking method achieves $T$ 2, $T$ 3, $T$ 4, and $T$ 5 (Li et al., 2017).

Video-based saliency benchmarking introduces a different temporal definition of saliency (Li et al., 2016). The benchmark dataset in “A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection” consists of 200 videos totaling 64 minutes, with 7,650 sampled keyframes and 7,467 retained keyframes after discarding frames containing only background or heavily split or occluded targets. The central definition is that salient objects in video are those that “consistently pop out” throughout the video by receiving high fixation density across time. Object selection combines manual mask annotation with eye-tracking data from 23 subjects, leading to 7,467 binary ground-truth masks (Li et al., 2016).

Another extension appears in salient object ranking. “Relative Saliency and Ranking: Models, Metrics, Data, and Benchmarks” defines saliency as inherently relative because multiple observers do not always agree on which objects are salient (Kalash et al., 2018). The associated CocoSalRank dataset is built from Microsoft-COCO instance segmentations with observer click data, and saliency is represented through per-object scores and ground-truth ranks. The benchmark evaluates predictions using Spearman’s rank correlation and Mean Absolute Rank Error rather than binary segmentation scores. This redefines the saliency problem from “is salient” to “how salient relative to other objects” (Kalash et al., 2018).

These variants show that the benchmark label “SalBench” has been adapted to multi-modal, temporal, and rank-based formulations. This suggests that the benchmark concept is portable across problem definitions so long as the operational notion of saliency is made explicit.

6. Low-level pop-out benchmarks and LVLM evaluation

A recent use of SalBench addresses a different failure mode: whether large vision-LLMs can detect low-level saliency cues that humans find obvious (Dahou et al., 7 Jul 2025). This benchmark repurposes two sources. The synthetic split, P3, contains 2,514 images organized into 810 color, 864 orientation, and 840 size cases, with 7×7 grids and one odd-one-out plus pixel jitter. The natural split, O3, contains 2,001 images, with 37% color singletons alone and 47% color combined with other features, plus shape, size, orientation, texture, focus, location, and pattern.

SalBench defines three tasks, all framed as multi-label classification over feature classes (Dahou et al., 7 Jul 2025). Odd-One-Out Detection takes an image only and asks for all feature labels in which the singleton differs. Referring Odd-One-Out adds a text-provided bounding box $T$ 6 identifying the odd object. Visual Referring Odd-One-Out instead highlights the odd object with a red box. For synthetic images the class set is $T$ 7, whereas for natural images it is $T$ 8. Evaluation uses exact-match accuracy and macro-averaged multi-label F1.

The benchmark controls anomaly difficulty explicitly. For orientation, the singleton is rotated by $T$ 9– $F_\beta$ 0 for hard, $F_\beta$ 1– $F_\beta$ 2 for medium, and $F_\beta$ 3– $F_\beta$ 4 for easy cases. For size, the singleton area ratio lies in $F_\beta$ 5 for hard, $F_\beta$ 6 for medium, and below $F_\beta$ 7 or above $F_\beta$ 8 for easy cases (Dahou et al., 7 Jul 2025).

Quantitative results show a sharp synthetic–natural gap. On the natural split, zero-shot F1 scores are 47.6, 47.3, and 42.6 for GPT-4o on Detection, Referring, and Visual Referring, respectively; 41.6, 44.6, and 41.7 for Qwen-72B; and 48.2, 51.1, and 53.9 for Claude (Dahou et al., 7 Jul 2025). On the synthetic split, the same models reach near-ceiling F1 in zero-shot detection and referring, with GPT-4o at 89.2% and 88.4%, and Qwen2-72B at 88.8% and 93.6%. Breakdown by feature for GPT-4o on synthetic detection shows that color is easiest and size is hardest: color yields 99.8%, 100.0%, and 66.1% on easy, medium, and hard cases, whereas size yields 93.3%, 59.1%, and 36.8% (Dahou et al., 7 Jul 2025).

The benchmark reports that naïve human subjects score approximately 100% on synthetic and above 95% on natural splits. It also identifies failure modes: subtle size anomalies are often misclassified as orientation or color, models neglect focus and depth cues in natural images, and F1 drops by approximately 7–10% when distractor count exceeds 25 (Dahou et al., 7 Jul 2025). Within this formulation, SalBench acts less as a segmentation benchmark than as a perceptual probe for whether LVLMs encode low-level visual features.

7. Significance, controversies, and future directions

Across its different instantiations, SalBench serves as an instrument for clarifying what is meant by saliency in computational vision. In the 2015 SOD benchmark, it differentiates salient object detection from fixation prediction and objectness, and concludes that methods designed specifically for salient object detection outperform neighboring paradigms (Borji et al., 2015). In the metric-centered benchmarking literature, the core controversy is that inconsistent metrics can produce inconsistent rankings unless models, maps, and metrics are explicitly separated (Kümmerer et al., 2017). In RGB-T and video settings, benchmark design emphasizes modality complementarity and temporal persistence rather than single-frame contrast (Li et al., 2017). In relative saliency, the benchmark controversy concerns whether binary saliency is ill-posed when multiple observers disagree, motivating object ranking instead of object detection (Kalash et al., 2018).

Open problems are similarly benchmark-dependent but recurrent in theme. The 2015 SOD benchmark calls for more challenging datasets containing multi-object, off-center, cluttered scenes and “background-only” images to test “no-salient-object” cases (Borji et al., 2015). SOC advances this agenda by explicitly including non-salient images and challenge-specific attributes, while arguing that dataset improvement can produce larger gains than decoder redesign alone (Fan et al., 2021). The RGB-T benchmark identifies future directions such as more scenes, more modalities, high-level priors or deep features, adaptive graph construction, and faster solvers (Li et al., 2017). The LVLM SalBench argues that “seeing the obvious” remains a major gap in current models and suggests that training curricula should explicitly include saliency-driven tasks and that neuroscience-inspired attention modules may help mimic early visual processing (Dahou et al., 7 Jul 2025).

A broad synthesis is possible without collapsing these benchmarks into one object. SalBench, across its usages, denotes a benchmark-oriented effort to isolate, measure, and stress-test saliency mechanisms under controlled assumptions. Those assumptions may concern binary foreground masks, fixation densities, relative rank, RGB-T complementarity, video-wide consistency, or low-level odd-one-out perception. The shared lesson is that benchmark construction is not ancillary to saliency research; it is one of the principal ways the field defines the problem itself.