Cause of performance decrease on Webis Touché-2020

Determine the underlying cause of the recurrent decrease in effectiveness of learned ranking models, including the SPLADE-v3 sparse retriever, on the Webis Touché-2020 argument retrieval dataset, and characterize the dataset or model-specific factors responsible for this outcome to inform potential remedies.

Background

In the meta-analysis across 44 query sets, the authors report statistically significant improvements over BM25 for most datasets but note a performance decrease on a small subset. One such dataset is Webis Touché-2020, where learned ranking models often struggle.

This issue appears not to be unique to SPLADE-v3, as the authors emphasize that similar patterns have been observed repeatedly with learned ranking models in prior work, suggesting a systemic problem tied to the dataset’s characteristics or the behavior of such models.

References

For Touché-2020, we are still unsure what is the actual issue, but this observation is recurrent with learned ranking models.

SPLADE-v3: New baselines for SPLADE (2403.06789 - Lassance et al., 11 Mar 2024) in Section 3, Comparison to BM25 paragraph