Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Answer is All You Need: Instruction-following Text Embedding via Answering the Question (2402.09642v1)

Published 15 Feb 2024 in cs.CL

Abstract: This work aims to build a text embedder that can capture characteristics of texts specified by user instructions. Despite its tremendous potential to deploy user-oriented embeddings, none of previous approaches provides a concrete solution for it. This paper offers a new viewpoint, which treats the instruction as a question about the input text and encodes the expected answers to obtain the representation accordingly. Intuitively, texts with the same (implicit) semantics would share similar answers following the instruction, thus leading to more similar embeddings. Specifically, we propose InBedder that instantiates this embed-via-answering idea by only fine-tuning LLMs on abstractive question answering tasks. InBedder demonstrates significantly improved instruction-following capabilities according to our proposed instruction awareness tests and instruction robustness tests, when applied to both LLMs (e.g., llama-2-7b) and smaller encoder-based LMs (e.g., roberta-large). Additionally, our qualitative analysis of clustering outcomes, achieved by applying different instructions to the same corpus, demonstrates a high degree of interpretability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Charu C. Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data, pages 77–128. Springer.
  2. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  3. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. Unsupervised aspect-based multi-document abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 42–47, Hong Kong, China. Association for Computational Linguistics.
  6. Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  7. C-STS: Conditional semantic textual similarity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5669–5690, Singapore. Association for Computational Linguistics.
  8. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  9. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6894–6910. Association for Computational Linguistics.
  10. Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. arXiv preprint arXiv:2310.02207.
  11. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.
  12. Multi-XScience: A large-scale dataset for extreme multi-document summarization of scientific articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8068–8074, Online. Association for Computational Linguistics.
  13. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
  14. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland. Association for Computational Linguistics.
  15. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 9844–9855. Association for Computational Linguistics.
  16. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  17. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  18. Word2Sense: Sparse interpretable word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5692–5705, Florence, Italy. Association for Computational Linguistics.
  19. Generating efficient training data via llm-based attribute manipulation. CoRR, abs/2307.07099.
  20. Nils Reimers and Iryna Gurevych. 2019a. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  21. Nils Reimers and Iryna Gurevych. 2019b. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
  22. Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
  23. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1102–1121. Association for Computational Linguistics.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  25. Function vectors in large language models.
  26. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  27. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  28. Learning and evaluating sparse interpretable sentence embeddings. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 200–210, Brussels, Belgium. Association for Computational Linguistics.
  29. Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533.
  30. Goal-driven explainable clustering via language descriptions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10626–10649, Singapore. Association for Computational Linguistics.
  31. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  32. C-pack: Packaged resources to advance general chinese embedding. CoRR, abs/2309.07597.
  33. Instruction tuning for large language models: A survey. CoRR, abs/2308.10792.
  34. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  35. ClusterLLM: Large language models as a guide for text clustering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13903–13920, Singapore. Association for Computational Linguistics.
  36. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
Citations (5)

Summary

  • The paper introduces the InBedder framework, a method that generates text embeddings by answering user-defined instructions to capture semantic nuances.
  • The authors demonstrate that encoding strategies like avg-gen and 1st-gen outperform traditional prompt-based methods on instruction-driven benchmarks.
  • This approach enables personalized search, text clustering, and interpretable AI, paving the way for scalable and efficient NLP systems.

Answer is All You Need: Instruction-following Text Embedding via Answering the Question

The paper "Answer is All You Need: Instruction-following Text Embedding via Answering the Question" presents a novel approach to text embedding aimed explicitly at addressing the limitations of existing models in capturing user-specific characteristics. Traditional text embedders are primarily designed to encode general textual similarities without the capacity to follow user-defined instructions. This research introduces InBedder, a framework that creates text embeddings by generating responses to user-defined instructions.

The InBedder Framework

The authors propose a paradigm shift in text embedding by treating the instruction as a query and using the generated answers to derive the embeddings. The underlying hypothesis is that texts with similar semantics will yield similar responses to the same instructions, thereby leading to embeddings that are closer in the vector space.

To validate this hypothesis, the authors fine-tune existing LLMs on a union of 11 abstractive question-answering datasets, amounting to approximately 200,000 paragraph-question-answer triplets. The fine-tuning objective is to generate concise and informative responses, filtering out extraneous content to maintain answer brevity. This preprocessing culminates in an average answer length of 2.89 words.

Methodology and Evaluation

Encoding Methods: The paper explores several encoding strategies from the hidden states of LLMs:

  1. Direct Encoding:
    • Average of generation (avg-gen)
    • Average of prompt hidden states (avg-ppt)
    • Hidden states used to predict the first token in generations (1st-gen)
    • Last generation hidden states (last-gen)
    • Average of all hidden states (avg-all)
  2. Re-encoding:
    • Generating answers and re-encoding them using another lightweight sentence transformer.

Observations: The experiments reveal that generated answers (avg-gen) hold significantly more information pertinent to the instruction than the prompt-based hidden states (avg-ppt). Additionally, re-encoding approaches enhance embedding quality, with hidden states corresponding to the first generated token (1st-gen) demonstrating particularly strong performance in the fine-tuned models.

Performance Evaluation: The paper introduces new benchmarks for evaluating instruction-following capabilities, which include:

  • IntentEmotion: A triplet task evaluating embeddings based on intent and emotion.
  • InstructSTSB: An instruction-based semantic textual similarity task.
  • NYTClustering: Clustering tasks altering instruction parameters to test model adaptability.

The InBedder framework outperforms both traditional sentence transformers and other LLM-based embeddings across these benchmarks. Importantly, InBedder demonstrates a robust understanding of correct and implicit instructions while maintaining high-quality embeddings even under incorrect instructions.

Implications and Future Developments

Practical Implications: InBedder's ability to generate high-quality, instruction-specific text embeddings presents significant advancements for user-driven applications such as personalized search engines, customized text clustering, and interpretable AI systems. It offers a flexible tool for aligning text embeddings with user-defined criteria, thereby enhancing the contextual relevance and utility of NLP systems in complex, application-specific scenarios.

Theoretical Implications: The proposed instruction-following framework leverages the interpretability of LLMs by utilizing expected answer distributions rather than concatenated instruction-text pairs. This novel approach promotes a deeper semantic understanding and more effective embeddings in diverse NLP tasks.

Future Work: Future investigations could explore more efficient solutions for large-scale retrieval systems, potentially by integrating InBedder with query-dependent reranker systems to minimize latency. Additionally, enhancing the effectiveness of InBedder in generic embedding tasks through optimized prompt designs remains an open research area.

Conclusion

The "Answer is All You Need" paper introduces a substantial advancement in the domain of text embeddings by leveraging instruction-following capabilities. Through the InBedder framework, the paper demonstrates superior performance in instruction-awareness and robustness while maintaining competitive results in traditional tasks. This research sets a new trajectory for developing user-oriented embedding models, encouraging further exploration into more efficient, scalable, and effective solutions in NLP.

[Link to the paper's repository: https://github.com/zhang-yu-wei/InBedder]

Youtube Logo Streamline Icon: https://streamlinehq.com