RecIF-Bench: Generative Recommender Benchmark
- RecIF-Bench is an evaluation suite designed to assess generative recommender systems through diverse tasks and reproducible protocols.
- It incorporates eight tasks spanning semantic alignment, core recommendation, instruction following, and reasoning across short video, advertising, and e-commerce domains.
- The benchmark offers a large-scale dataset with 96 million interactions and open-sourced tooling to ensure robust and transparent model evaluation.
RecIF-Bench (“Recommendation Instruction-Following Benchmark”) is a holistic evaluation suite designed for generative recommender-foundation models. It provides the first comprehensive testbed to rigorously assess a wide spectrum of capabilities—from fundamental recommendation and semantic alignment to instruction following and reasoning—across multiple industry domains. Developed to address the limitations of narrow, specialist benchmarks, RecIF-Bench incorporates diverse tasks, interleaved user and item representation, and reproducible protocols for the emerging class of large-scale generative recommender systems (Zhou et al., 31 Dec 2025).
1. Motivation and Scope
RecIF-Bench was developed to bridge the gap between conventional recommendation systems, which excel at pattern matching within narrowly defined domains, and LLMs that can exhibit world knowledge, reasoning, and instruction-following behaviors. Prevailing benchmarks in recommender systems typically focus on single domains or tasks and lack evaluation for instruction following, multi-behavior modeling, and explicit reasoning over user context.
The benchmark covers three major industrial domains:
- Short Video
- Advertising (Ad)
- E-commerce Product
Eight tasks are included, spanning four capability layers—semantic alignment, fundamental recommendation, instruction following, and reasoning. The suite is accompanied by the release of a large-scale dataset for training, consisting of approximately 96 million interactions from 160,000 users, and open-sourced tooling, including pre-tokenized "itemic" codes to ensure robust reproducibility.
2. Task Taxonomy
All tasks in RecIF-Bench are formulated in sequence-to-sequence format: given an instruction and a user context , the model predicts a token sequence . Key user representations include the user history and a composite user portrait .
The eight tasks are organized across four layers of capability:
| Layer | Task Name | Input Modality | Target | Metric(s) |
|---|---|---|---|---|
| 0 | Item Understanding | ("Describe item"), itemic tokens | Item metadata (text) | LLM-as-Judge (F1) |
| 1 | Short Video Rec | Next video item | Pass@1/32, Recall@32 | |
| 1 | Ad Rec (cross-domain) | Next ad item | Pass@1/ |