Applicability of ShufflEval to image captioning

Determine whether and how the ShufflEval evaluation methodology—based on segment-by-segment translation and order-based plausibility comparisons—can be applied to image captioning tasks where the source modality lacks an inherent linear order, and if feasible, construct a concrete adaptation that enables reference-free evaluation for image captions.

Background

ShufflEval evaluates translators by checking whether the original order of translated segments is more plausible than shuffled permutations, leveraging temporal order as implicit grounding. The paper notes that this relies on a common linear order between source and target, which holds naturally for sequential modalities (e.g., dialogue turns, paragraphs, video frames) but not for static images.

Because image captioning lacks an inherent temporal sequence in the source, directly applying ShufflEval is non-obvious. Establishing an adaptation (or determining infeasibility) would clarify whether reference-free, order-based evaluation can extend beyond sequential data to single-image inputs.

References

Put another way, it is not clear how to apply ShufflEval to image captioning, but it could be applied to describing videos.

— On Non-interactive Evaluation of Animal Communication Translators (2510.15768 - Paradise et al., 17 Oct 2025) in Section 2.4 (Implications for ShufflEval)

Applicability of ShufflEval to image captioning

Sponsor

Background

References

Related Problems