Size-Invariant Evaluation (SIEva)

Updated 4 July 2026

Size-Invariant Evaluation (SIEva) is an evaluation protocol that partitions images into object-level parts and neutralizes size bias by ensuring each region contributes equally.
It reformulates traditional metrics by replacing size-dependent weights with a constant, allowing fair assessment of both large and small salient objects.
SIEva has been applied in salient object detection, graph drawing, and optimizer comparisons, demonstrating improved accuracy in multi-object and small object scenarios.

Searching arXiv for the cited SIEva-related papers to ground the article and confirm identifiers. First, I’ll look up the core SIEva formulation in salient object detection, then the graph-drawing stress paper that motivates size-invariant evaluation more generally. Size-Invariant Evaluation (SIEva) denotes an evaluation principle in which performance is judged after removing or neutralizing variation caused solely by the size, scale, or partitioning of the representation rather than by the underlying quality of the object being assessed. In the arXiv literature, the term is used most explicitly for salient object detection (SOD), where SIEva reformulates metrics so that each salient object is evaluated separately and then aggregated with equal weight, instead of allowing large regions to dominate the score (Bao et al., 19 Sep 2025). The same methodological logic also appears in other domains: graph drawing metrics that should not change under uniform layout scaling, optimizer comparisons that should not depend on mini-batch partitioning, and representation analyses that test whether semantic features remain stable across size regimes (Ahmed et al., 2024).

1. Definition and conceptual scope

SIEva is a size-invariant evaluation protocol for SOD. Its core idea is to split an image into object-level parts, evaluate each part independently, and then aggregate the per-part scores with equal weight. The framework is motivated by the observation that standard SOD metrics are intrinsically size-sensitive: in multi-object scenes, larger salient regions dominate the final score, while smaller yet potentially important objects can be missed with relatively little penalty (Bao et al., 19 Sep 2025).

In this formulation, size invariance does not mean discarding object structure or suppressing background. It means replacing the size-related contribution term in the metric by a constant so that evaluation depends on object-wise quality rather than pixel mass. The same papers emphasize that this is not merely a cosmetic change in reporting. Because standard losses are usually aligned with standard metrics, size-sensitive evaluation can induce size-sensitive optimization and thereby bias model development toward large objects (Li et al., 2024).

A broader interpretation of SIEva is suggested by related work outside SOD. In graph drawing, for example, a layout metric is considered unfair if uniformly scaling a geometrically identical drawing changes the score; in that setting, scale-normalized stress is proposed precisely because evaluation should reflect geometric fidelity rather than arbitrary coordinate units (Ahmed et al., 2024). This suggests a general principle: a valid evaluation protocol should factor out nuisance transformations that do not alter the intrinsic object of interest.

2. Mathematical basis: why conventional metrics are size-sensitive

The SIEva literature formalizes evaluation by partitioning an image into non-overlapping parts and then asking whether a metric decomposes into part-wise terms with region-size-dependent weights. A metric is called separable if

$v(f(\boldsymbol{X}), \boldsymbol{Y})=\sum_{k=1}^K \lambda(\boldsymbol{X}_k)\cdot v(f(\boldsymbol{X}_k), \boldsymbol{Y}_k),$

where $\lambda(\boldsymbol{X}_k)$ depends on the size of region $k$ (Bao et al., 19 Sep 2025).

This decomposition exposes the source of size bias. For MAE, the weight is explicitly proportional to region size. In the notation of the SOD papers, the global metric can be rewritten as a weighted sum of per-part MAEs with

$\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$

so larger regions contribute more to the total score (Bao et al., 19 Sep 2025). The earlier formulation of the same idea states the same result as

$MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$

again showing that larger regions dominate (Li et al., 2024).

The same argument is extended to composite metrics, defined as compositions of separable functions,

$V(f(\boldsymbol{X}), \boldsymbol{Y})=(v_1\circ v_2 \circ \dots \circ v_T)(f(\boldsymbol{X}),\boldsymbol{Y}).$

In the SOD setting, $F$ -measure and AUC inherit size dependence through foreground-region weights proportional to salient-region size. Even rank-based AUC remains size-sensitive in this decomposition because each positive region contributes in proportion to the number of salient pixels it contains (Bao et al., 19 Sep 2025).

The practical consequence is a systematic misalignment between metric value and object-balanced judgment. The SOD papers give the canonical failure mode: a prediction that captures only the large object can obtain a better standard score than a prediction that detects both large and small objects more faithfully (Li et al., 2024). An analogous failure appears in graph drawing, where normalized stress can rank a random layout better than stress-optimized layouts over wide scale ranges simply because the drawing was output at a more favorable size (Ahmed et al., 2024).

3. Partitioning strategy and size-invariant metric construction

SIEva corrects size sensitivity in two steps. First, it removes the size-dependent weight by setting it to a constant:

$\lambda(\boldsymbol{X}_k)\equiv 1.$

For separable metrics, the size-invariant form becomes

$v_{SI}(f)=\frac{1}{|\mathcal{C}(\boldsymbol{X})|}\sum_{k=1}^{|\mathcal{C}(\boldsymbol{X})|} v(f_k),$

so each component contributes equally and the factor $1/|\mathcal{C}(\boldsymbol{X})|$ keeps the score in $\lambda(\boldsymbol{X}_k)$ 0 (Bao et al., 19 Sep 2025).

Second, the image must be partitioned. The default SIEva partition is object bounding-box based. Connected component analysis is run on the ground-truth saliency mask; each connected salient region becomes a proxy object; each object is enclosed in its minimum rectangle bounding box; and the remaining pixels form a background region. Because many SOD datasets provide only binary masks rather than instance labels, connected components are used as object proxies and can be computed offline (Bao et al., 19 Sep 2025).

The resulting metrics are direct object-wise analogues of standard SOD scores.

Metric	Standard weighting	Size-invariant form
MAE	Weighted by region size	Average per part, with background factor
F-measure	Weighted by salient-region mass	Average per foreground object
AUC	Weighted by positive-region contribution	Average per foreground object

For MAE, the size-invariant version with adaptive background weighting is

$\lambda(\boldsymbol{X}_k)$ 1

with

$\lambda(\boldsymbol{X}_k)$ 2

This design preserves equal treatment among salient objects while still discouraging false positives in the background (Bao et al., 19 Sep 2025).

For the $\lambda(\boldsymbol{X}_k)$ 3-measure and AUC, the foreground objects are averaged directly:

$\lambda(\boldsymbol{X}_k)$ 4

and

$\lambda(\boldsymbol{X}_k)$ 5

In the AUC definition, each object’s positive pixels are compared against all negative pixels in the full image (Bao et al., 19 Sep 2025).

A recurrent misconception is that size invariance requires ignoring background or treating the whole image as a set of isolated crops. The SIEva papers explicitly reject both readings. Background is retained through $\lambda(\boldsymbol{X}_k)$ 6, and the partition is a structured decomposition of the original image rather than a replacement of global context (Li et al., 2024).

4. Optimization frameworks and theoretical analysis

The SIEva principle is paired with a training framework that applies the same object-wise logic to losses. In the earlier SOD paper this framework is called SI-SOD; in the later formulation it is called SIOpt. In both cases, the generic objective decomposes the image into object-level foreground regions and background, then aggregates their losses with balancing coefficients (Li et al., 2024).

A representative form is

$\lambda(\boldsymbol{X}_k)$ 7

This formulation is model-agnostic in the sense stated by the later paper: only the loss is modified, not the backbone architecture, so it can be integrated with CNNs, transformers, and SAM-based models (Bao et al., 19 Sep 2025).

For separable losses, the background balancing coefficient mirrors the metric design. For composite region-aware losses such as DiceLoss and IoU Loss, the earlier paper sets $\lambda(\boldsymbol{X}_k)$ 8 because the background has no positive salient pixels and the region-level foreground objective is the main focus (Li et al., 2024). The same line of work also proposes a size-invariant AUC surrogate and a hybrid objective,

$\lambda(\boldsymbol{X}_k)$ 9

together with Pixel-level Bipartite Acceleration (PBAcc), which rewrites the naive pairwise objective as a quadratic form so that computation becomes nearly $k$ 0 time and space per image rather than $k$ 1 (Bao et al., 19 Sep 2025).

The theory in these papers has two parts. First, they show that the object-wise reformulations better reflect the intended ranking of predictions in multi-object scenes. For instance, when one prediction perfectly detects only a large object and another partially detects both large and small objects, standard MAE or standard $k$ 2-measure can assign equal scores, whereas the size-invariant versions prefer the prediction that covers both objects (Li et al., 2024). Second, they give generalization analyses. One theorem is reported with rate on the order of

$k$ 3

and the later paper interprets its bound as an $k$ 4-type result depending on notation (Li et al., 2024). A key claim is that object-localized evaluation can yield a smaller Lipschitz constant for composite losses because the denominator is taken over an object box rather than a full image, which sharpens the bound (Bao et al., 19 Sep 2025).

5. Empirical behavior in salient object detection

The empirical evidence is concentrated on SOD datasets with multiple objects and broad size variation. The earlier paper evaluates on MSOD, DUTS, ECSSD, DUT-OMRON, HKU-IS, PASCAL-S, SOD, and XPIE, with backbones including EDN, ICON, GateNet, LDF, and PoolNet (Li et al., 2024). The later paper extends reporting across RGB, RGB-D, RGB-T, and SAM-based settings, including DUTS, DUT-OMRON, MSOD, ECSSD, HKU-IS, SOD, PASCAL-S, XPIE, NJUD-TE, NLPR-TE, STERE, VT821, VT1000, VT5000, and TS-SAM fine-tuning (Bao et al., 19 Sep 2025).

The main empirical pattern is stable across versions of the framework. Gains are strongest on multi-object datasets and in the smallest object-size bins. On MSOD, the earlier paper reports average improvements of roughly $k$ 5 on one SI metric, $k$ 6 on another SI metric variant, $k$ 7 on $k$ 8, $k$ 9 on $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 0, and $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 1 on $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 2 after SI optimization (Li et al., 2024). In fine-grained analysis, objects in the $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 3 size range benefit most: EDN improves by about $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 4 on SI after using the SI loss. Performance gains also increase with the number of salient objects; on MSOD, when the number of salient objects is at least $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 5, EDN gains about $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 6 on SI after SI optimization (Li et al., 2024).

The later paper reports the same qualitative structure on a broader experimental base. It states that SIOpt generally improves $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 7, $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 8, $\lambda(\boldsymbol{X}_k)=\mathbb{P}_{X_k}=\frac{S_k}{S},$ 9, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 0-AUC, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 1, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 2, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 3, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 4, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 5, and $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 6, with especially strong gains on multi-object datasets such as MSOD and HKU-IS (Bao et al., 19 Sep 2025). The papers also report ablations on background weighting: $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 7 causes poor background suppression and many false positives, $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 8 is better, and the adaptive choice based on the ratio of background size to total foreground size performs best overall (Li et al., 2024).

These results are consistent with the motivating argument. Standard metrics already favor large objects, so changing the evaluation and loss has limited effect when one dominant object occupies most of the image. The effect becomes much larger when several salient objects coexist or when small objects occupy only a small fraction of image area. This suggests that SIEva is most consequential precisely in the imbalanced regimes for which it was designed.

Although the exact abbreviation SIEva is most closely associated with SOD, the underlying principle appears across several research areas.

Domain	Nuisance “size” variable	Invariant mechanism
Graph drawing	Uniform drawing scale	Scale-normalized stress
Optimizer comparison	Mini-/micro-batch partition size	Average squared micro-gradients in Adam
Deepfake detection	Face size relative to frame	Size Embedding with multi-identity aggregation
DNN representation analysis	Pixel size of depicted person	Size-binned concept embeddings
Spectropolarimetry	Incident beam diameter	Circular-sector power balance
Open quantum systems	Aggregate size	Adaptive basis with size-invariant scaling

In graph drawing, the paper “Size Should not Matter: Scale-invariant Stress Metrics” argues that normalized stress is not actually normalized with respect to drawing size. If a layout is uniformly scaled by $MAE(f)=\sum_{i=1}^{N_c}\frac{S_i}{S}\cdot MAE(f_i),$ 9, the stress value can change drastically even though the drawing is geometrically identical up to scale. The proposed scale-normalized stress (SNS) minimizes normalized stress over all uniform rescalings,

$V(f(\boldsymbol{X}), \boldsymbol{Y})=(v_1\circ v_2 \circ \dots \circ v_T)(f(\boldsymbol{X}),\boldsymbol{Y}).$ 0

with a closed-form optimizer

$V(f(\boldsymbol{X}), \boldsymbol{Y})=(v_1\circ v_2 \circ \dots \circ v_T)(f(\boldsymbol{X}),\boldsymbol{Y}).$ 1

and is presented as the fair comparison metric because two drawings that differ only by uniform scaling receive the same score (Ahmed et al., 2024). This is a direct instance of size-invariant evaluation in practice.

In large-scale optimization, “Batch size invariant Adam” changes only the second-moment update: instead of squaring the average mini-batch gradient, it averages the squared micro-gradients. The paper’s point is that the raw second moment then no longer depends on how the data are grouped into micro-batches, at least in the idealized proof setting. The invariance claim is local in update size and uses linear hyperparameter scaling, but it is explicitly framed as a way to make training comparisons less sensitive to arbitrary batch partitioning (Wang et al., 2024).

In video deepfake detection, MINTIME is described as size-invariant and multi-identity aware because it preserves information about each face’s area relative to the frame rather than normalizing it away. Its Size Embedding maps the face-frame area ratio into one of $V(f(\boldsymbol{X}), \boldsymbol{Y})=(v_1\circ v_2 \circ \dots \circ v_T)(f(\boldsymbol{X}),\boldsymbol{Y}).$ 2 bins spanning $V(f(\boldsymbol{X}), \boldsymbol{Y})=(v_1\circ v_2 \circ \dots \circ v_T)(f(\boldsymbol{X}),\boldsymbol{Y}).$ 3, and the model combines this with Temporal Coherent Positional Embedding and Identity-aware Attention so that multiple identities can be aggregated without reducing the video to the largest face (Coccomini et al., 2022). Here the invariant idea is not metric reformulation but explicit encoding of size so that predictions remain robust when size varies.

At the representation-analysis level, “Verification of Size Invariance in DNN Activations using Concept Embeddings” defines size invariance operationally as stability of concept-embedding directions and segmentation performance across person-size bins. Using concept models on intermediate activations, the study finds that body-part representations in AlexNet, VGG16, and Mask R-CNN are mostly size invariant, although the far category deviates the most and category-specific embeddings are not perfectly identical (Schwalbe, 2021). This work is closely aligned with SIEva in spirit because it evaluates invariance at the level of internal representation rather than only final predictions.

Other works show that the notion has domain-specific limits. Beam-size invariant spectropolarimetry depends on a centered beam with angle-independent intensity distribution; otherwise the equal-power argument across circular sectors breaks down (Ding et al., 2017). The adHOPS extension in MesoHops achieves size-invariant scaling in runtime for localized open-quantum-system dynamics, but this is a scaling methodology rather than an evaluation protocol (Citty et al., 2024). RISWIE provides rigid-transformation-invariant comparison of distributions and point clouds, yet it is explicitly not scale-invariant by default, so it is better viewed as an instructive invariant-comparison template than as a direct SIEva method (He et al., 11 Oct 2025).

Taken together, these works establish a consistent methodological theme. A metric, optimizer, representation probe, or physical instrument should not change its judgment merely because the same underlying object is represented at a different scale, size, or partition granularity. In SOD, this theme is formalized under the name SIEva; elsewhere, it appears as scale-normalized stress, batch-size-invariant optimization, size-aware representation learning, and beam-size-invariant measurement. The common criterion is that evaluation should track intrinsic structure rather than arbitrary size conventions.