RecIF-Bench: Generative Recommender Benchmark

Updated 17 March 2026

RecIF-Bench is an evaluation suite designed to assess generative recommender systems through diverse tasks and reproducible protocols.
It incorporates eight tasks spanning semantic alignment, core recommendation, instruction following, and reasoning across short video, advertising, and e-commerce domains.
The benchmark offers a large-scale dataset with 96 million interactions and open-sourced tooling to ensure robust and transparent model evaluation.

RecIF-Bench (“Recommendation Instruction-Following Benchmark”) is a holistic evaluation suite designed for generative recommender-foundation models. It provides the first comprehensive testbed to rigorously assess a wide spectrum of capabilities—from fundamental recommendation and semantic alignment to instruction following and reasoning—across multiple industry domains. Developed to address the limitations of narrow, specialist benchmarks, RecIF-Bench incorporates diverse tasks, interleaved user and item representation, and reproducible protocols for the emerging class of large-scale generative recommender systems (Zhou et al., 31 Dec 2025).

1. Motivation and Scope

RecIF-Bench was developed to bridge the gap between conventional recommendation systems, which excel at pattern matching within narrowly defined domains, and LLMs that can exhibit world knowledge, reasoning, and instruction-following behaviors. Prevailing benchmarks in recommender systems typically focus on single domains or tasks and lack evaluation for instruction following, multi-behavior modeling, and explicit reasoning over user context.

The benchmark covers three major industrial domains:

Short Video
Advertising (Ad)
E-commerce Product

Eight tasks are included, spanning four capability layers—semantic alignment, fundamental recommendation, instruction following, and reasoning. The suite is accompanied by the release of a large-scale dataset for training, consisting of approximately 96 million interactions from 160,000 users, and open-sourced tooling, including pre-tokenized "itemic" codes to ensure robust reproducibility.

2. Task Taxonomy

All tasks in RecIF-Bench are formulated in sequence-to-sequence format: given an instruction $\mathcal I$ and a user context $\mathcal C$ , the model predicts a token sequence $Y$ . Key user representations include the user history $\mathcal H$ and a composite user portrait $\mathcal P$ .

The eight tasks are organized across four layers of capability:

Layer	Task Name	Input Modality	Target	Metric(s)
0	Item Understanding	$\mathcal I$ ("Describe item"), itemic tokens	Item metadata (text)	LLM-as-Judge (F1)
1	Short Video Rec	$\mathcal H^{video}$	Next video item	Pass@1/32, Recall@32
1	Ad Rec (cross-domain)	$\mathcal H^{video} \cup \mathcal H^{ad}$	Next ad item	Pass@1/

Markdown Report Issue Upgrade to Chat

References (1)

OpenOneRec Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RecIF-Bench.