Dice Question Streamline Icon: https://streamlinehq.com

Effect of Text Length on Embedding Inversion and Information Retention

Determine whether longer MS-Marco passages encoded by the Contriever text encoder indeed contain more information that makes exact reconstruction harder and causes the resulting embeddings to discard more details from the input, thereby reducing inversion performance with increasing text length.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors evaluate how the length of original texts affects inversion performance using the Contriever encoder on MS-Marco passages. They bucket texts by token count and observe that both F1 scores and cosine similarity generally decrease as length increases up to 64 tokens.

They propose a conjectured explanation for the observed trend, suggesting that longer texts carry more information, which hampers exact reconstruction and leads embeddings to discard more input details. This explanation remains conjectural and invites empirical verification or theoretical justification.

References

Our conjectured explanation is that longer texts contain more information, making exact reconstruction harder and leading to embeddings that discard more details from the input.

Universal Zero-shot Embedding Inversion (2504.00147 - Zhang et al., 31 Mar 2025) in Subsection "Effect of Text Length" (Section 5); Table \ref{tab:length_effect}