Papers
Topics
Authors
Recent
Search
2000 character limit reached

RecIF-Bench: Generative Recommender Benchmark

Updated 17 March 2026
  • RecIF-Bench is an evaluation suite designed to assess generative recommender systems through diverse tasks and reproducible protocols.
  • It incorporates eight tasks spanning semantic alignment, core recommendation, instruction following, and reasoning across short video, advertising, and e-commerce domains.
  • The benchmark offers a large-scale dataset with 96 million interactions and open-sourced tooling to ensure robust and transparent model evaluation.

RecIF-Bench (“Recommendation Instruction-Following Benchmark”) is a holistic evaluation suite designed for generative recommender-foundation models. It provides the first comprehensive testbed to rigorously assess a wide spectrum of capabilities—from fundamental recommendation and semantic alignment to instruction following and reasoning—across multiple industry domains. Developed to address the limitations of narrow, specialist benchmarks, RecIF-Bench incorporates diverse tasks, interleaved user and item representation, and reproducible protocols for the emerging class of large-scale generative recommender systems (Zhou et al., 31 Dec 2025).

1. Motivation and Scope

RecIF-Bench was developed to bridge the gap between conventional recommendation systems, which excel at pattern matching within narrowly defined domains, and LLMs that can exhibit world knowledge, reasoning, and instruction-following behaviors. Prevailing benchmarks in recommender systems typically focus on single domains or tasks and lack evaluation for instruction following, multi-behavior modeling, and explicit reasoning over user context.

The benchmark covers three major industrial domains:

  • Short Video
  • Advertising (Ad)
  • E-commerce Product

Eight tasks are included, spanning four capability layers—semantic alignment, fundamental recommendation, instruction following, and reasoning. The suite is accompanied by the release of a large-scale dataset for training, consisting of approximately 96 million interactions from 160,000 users, and open-sourced tooling, including pre-tokenized "itemic" codes to ensure robust reproducibility.

2. Task Taxonomy

All tasks in RecIF-Bench are formulated in sequence-to-sequence format: given an instruction I\mathcal I and a user context C\mathcal C, the model predicts a token sequence YY. Key user representations include the user history H\mathcal H and a composite user portrait P\mathcal P.

The eight tasks are organized across four layers of capability:

Layer Task Name Input Modality Target Metric(s)
0 Item Understanding I\mathcal I ("Describe item"), itemic tokens Item metadata (text) LLM-as-Judge (F1)
1 Short Video Rec Hvideo\mathcal H^{video} Next video item Pass@1/32, Recall@32
1 Ad Rec (cross-domain) HvideoHad\mathcal H^{video} \cup \mathcal H^{ad} Next ad item Pass@1/
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RecIF-Bench.