InnoEval: Multi-Criteria Idea Evaluation

Updated 4 July 2026

InnoEval is a structured framework for research idea evaluation that leverages evidence from literature, web discourse, and code to assess ideas across five key criteria.
It employs a heterogeneous deep search engine with semantic and LLM-based relevance, grounding evidence and ensuring temporal fairness in evaluations.
The framework integrates multi-perspective reviews from diverse academic personas, enabling decoupled scoring and consensus-based decisions for enhanced innovation assessment.

InnoEval is a framework for research idea evaluation that treats assessment as a knowledge-grounded, multi-perspective reasoning problem with decoupled, multi-criteria decision-making. In its principal formulation, an idea is represented as a structured object, evaluated against dynamically retrieved evidence from literature, web discourse, and code repositories, reviewed by multiple academic personas across five criteria—Clarity, Novelty, Feasibility, Validity, and Significance—and then synthesized into a meta-review with a final score and decision in $\{\text{Reject}, \text{Poster}, \text{Spotlight}, \text{Oral}\}$ (Qiao et al., 16 Feb 2026). The term has also been used more broadly for innovation-oriented evaluation in information retrieval, AI-agent benchmarking, and platform design, but the most explicit and unified technical formulation is the 2026 research-idea evaluation framework (Qiao et al., 16 Feb 2026).

1. Conceptual basis

InnoEval is motivated by the claim that scientific idea evaluation should be grounded in a living ecosystem of theory and practice, should reflect collective deliberation among diverse expert perspectives, and should follow multi-criteria decision-making rather than compressing judgments into a single dimension (Qiao et al., 16 Feb 2026). The framework is positioned against three recurrent limitations in automated idea assessment: narrow knowledge horizons, flattened evaluation dimensions, and the bias of treating a single LLM as an all-purpose judge.

The system formalizes an idea as a structured tuple

$I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$

with timestamp $t$ . A point-wise evaluation returns

$P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$

where $K$ is background knowledge, $V$ is revision suggestions, and $E_{\text{point}}$ contains the meta-review, final score $s_{\text{point}}$ , and final decision $d_{\text{point}}$ . For sets of ideas, the framework produces

$P_{\text{group}} = \{\{P_{\text{point}}^{I_i}\}_{i=1}^{n}, E_{\text{group}}\} = F(\{I_i\}_{i=1}^{n}),$

where $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 0 contains comparative analyses and a final ranking (Qiao et al., 16 Feb 2026).

This design makes idea evaluation explicitly evidential rather than purely parametric. Knowledge available before timestamp $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 1 is used for evaluation, whereas knowledge after $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 2 is reserved for revision suggestions. The separation is intended to preserve temporal fairness while still allowing forward-looking feedback (Qiao et al., 16 Feb 2026).

2. Knowledge retrieval and grounding pipeline

The first major subsystem is a heterogeneous deep knowledge search engine. An Extraction Agent $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 3 converts the raw proposal into the structured representation $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 4, after which a Search Agent $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 5 performs multi-round, source-specific retrieval over literature, web opinions, and code. The literature sources are arXiv, Semantic Scholar, and Google Scholar; web opinions are obtained through Google Search restricted to research discourse domains such as blogs and discussion forums; code sources are GitHub and Kaggle (Qiao et al., 16 Feb 2026).

Fast retrieval issues tailored, synonym-expanded queries for each idea part $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 6 and source $I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 7:

$I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 8

The retrieved results are partitioned into literature, web, and code sets. These are then reranked by a hybrid scoring rule that combines semantic similarity with LLM-based relevance judgment:

$I = (\mathrm{TLDR}, \mathrm{Motis}, \mathrm{ResQues}, \mathrm{Meths}, \mathrm{ExpSets}, \mathrm{ExpRes})$ 9

$t$ 0

$t$ 1

In the reported setting, $t$ 2 and $t$ 3 (Qiao et al., 16 Feb 2026).

Slow retrieval then enriches the top-ranked evidence. Papers are parsed from PDFs into structured representations, web pages are summarized, and repositories are analyzed through file/function graphs and synthesized core snippets. Query refinement proceeds iteratively:

$t$ 4

with $t$ 5 refinement rounds by default (Qiao et al., 16 Feb 2026). The stated goal is to expand coverage without drifting off-topic.

Grounding is handled by a separate Grounding Agent $t$ 6. For each idea part $t$ 7 and retrieved knowledge item $t$ 8, it extracts evidence $t$ 9 and a relevance analysis $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 0:

$P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 1

The resulting grounded evidence set

$P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 2

serves as the citation-like factual substrate for downstream reviewers. This grounding step is central to the framework’s attempt to reduce noisy retrieval and unsupported judgments (Qiao et al., 16 Feb 2026).

3. Multi-perspective review board and decoupled criteria

InnoEval’s second defining subsystem is an innovation review board composed of academic personas with distinct backgrounds, goals, constraints, and source-specific familiarity profiles. The paper gives examples such as a senior researcher, a creative-oriented reviewer, a theoretical scientist, an industry engineer, and an empirical experimentalist (Qiao et al., 16 Feb 2026). For each idea, five personas are sampled.

A notable design choice is familiarity-based evidence masking. Each persona carries a knowledge familiarity vector across literature, web, and code, and a proportion of grounded evidence from each source is randomly masked according to that familiarity. The example given is that Literature Familiarity $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 3 implies 20% random masking of literature grounding. The intent is to simulate realistic partial knowledge rather than omniscient review (Qiao et al., 16 Feb 2026).

Evaluation is decoupled across five metrics,

$P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 4

and each metric is scored independently by a dedicated evaluator agent:

$P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 5

Scores lie in $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 6 and are accompanied by detailed rationales. A Report Agent $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 7 then synthesizes the full set of reviewer-dimension judgments into a meta-review, final score, and decision (Qiao et al., 16 Feb 2026).

Consensus is therefore procedural rather than closed-form. The paper does not specify a weighted voting rule, Borda aggregation, or any other explicit formula for reviewer fusion. Instead, consensus emerges from diversity in personas, independent per-metric assessments, and the language-based meta-reviewer that integrates the set of $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 8 into the actionable outcome (Qiao et al., 16 Feb 2026). This distinguishes the framework from single-judge LLM pipelines and from scalar scoring schemes that flatten idea quality into one latent dimension.

4. Benchmark construction and evaluation tasks

The empirical evaluation relies on three datasets derived from authoritative, peer-reviewed OpenReview submissions. The point-wise dataset $P_{\text{point}} = \{K, V, E_{\text{point}}\} = F(I),$ 9 contains 217 ideas extracted from ICLR 2025 and NeurIPS 2025 submissions, manually corrected and verified for fidelity. The label distribution is Reject 138, Poster 66, Spotlight 9, and Oral 4, with “Highlight” defined as Spotlight plus Oral (Qiao et al., 16 Feb 2026).

Dataset	Composition	Tasks and metrics
$K$ 0	217 ideas from ICLR 2025 (136) and NeurIPS 2025 (81)	Binary and ternary classification; Accuracy and macro-F1
$K$ 1	372 pairs from $K$ 2: 172 easy and 200 hard	Pair-wise preference Accuracy
$K$ 3	172 grouped instances built from topically similar papers across label strata	Best selection Accuracy; full ranking via LIS and Accuracy

The pair-wise dataset $K$ 4 distinguishes easy cases, such as Reject versus Highlight, from hard cases, such as Poster versus Highlight. The group-wise dataset $K$ 5 is formed by retrieving topically similar papers and selecting one per label stratum, then ranking the group by the papers’ ground-truth labels (Qiao et al., 16 Feb 2026).

For full ranking, the framework uses the longest increasing subsequence metric:

$K$ 6

where $K$ 7 is the length of the longest increasing subsequence relative to the predicted order and $K$ 8 is the gold ranking length (Qiao et al., 16 Feb 2026).

The paper also defines retrieval-quality measures for the search module—Relevance Density, Topic Coverage, Diversity, and Quality—to compare search engines and retrieval pipelines. These metrics are not the main end-task targets, but they are used to analyze whether the retrieval subsystem balances topical focus with breadth (Qiao et al., 16 Feb 2026).

5. Empirical performance, ablations, and alignment

In the main quantitative evaluation, InnoEval outperforms the reported baselines across point-wise, pair-wise, and group-wise tasks. Its point-wise binary results are $K$ 9 and $V$ 0, while its point-wise ternary results are $V$ 1 and $V$ 2 (Qiao et al., 16 Feb 2026). On pair-wise evaluation, it achieves 80.81 on easy pairs and 63.00 on hard pairs. On group-wise evaluation, it reaches 65.12 for Best, 76.03 for LIS, and 22.09 for Accuracy (Qiao et al., 16 Feb 2026).

Relative to the strongest baseline, ScholarEval, the reported gains are substantial: $V$ 3 F1 points on ternary point-wise evaluation, $V$ 4 Accuracy points on easy pair-wise comparisons, $V$ 5 on hard pair-wise comparisons, $V$ 6 on group-wise Best selection, $V$ 7 on LIS, and $V$ 8 on group-wise Accuracy (Qiao et al., 16 Feb 2026). The paper attributes part of this advantage to multi-source evidence and decoupled criteria, noting that several baselines exhibit label collapse, with F1 materially below Accuracy.

Qualitative comparison using o4-mini as judge reports that InnoEval’s reviews beat CoT, RAG, ResearchAgent, InternAgent, and ScholarEval on Overall Quality in 90.70%, 90.32%, 89.86%, 85.71%, and 71.89% of cases, respectively. The advantages are also reported along Rationality, Supportiveness, Depth, and Constructiveness (Qiao et al., 16 Feb 2026).

Human evaluation on 60 sampled instances finds strong positive Pearson correlations of at least 0.5 between InnoEval’s scores and both human scores and LLM-extracted scores from peer-review comments. Clarity shows the highest correlation, whereas Significance is lower, which the paper attributes to the inherent difficulty of estimating broad impact (Qiao et al., 16 Feb 2026).

The ablation studies identify three components as especially consequential. Removing the grounding agent degrades performance, indicating that fine-grained evidence selection matters. Disabling persona personalization reduces both point-wise and group-wise performance, supporting the claim that multi-perspective review mitigates single-judge subjectivity. Restricting retrieval to literature only also harms results, particularly for pair-wise and group-wise comparisons, where web and code evidence add discriminative context (Qiao et al., 16 Feb 2026). Additional analysis reports that multi-perspective test-time scaling improves results more than vanilla test-time scaling without personas, and that the main setting of five personas balances effectiveness and inference cost.

The system is also used as feedback for idea generation. When its reviews are inserted into ResearchAgent’s idea iteration pipeline, idea quality improves across problem formulation, methodology, and experimental design, more than with ScholarEval’s feedback. Linear regression on point-wise predictions further suggests that Novelty is the most decisive predictor of acceptance, while Feasibility becomes more important when distinguishing Poster from Highlight; Clarity has the weakest effect among the five dimensions for accept/highlight outcomes (Qiao et al., 16 Feb 2026).

6. Operational characteristics, limitations, and safeguards

The reported reference configuration uses DeepSeek-V3.2 as backbone, with bge-base-en-v1.5 embeddings and bge-reranker-base for retrieval and reranking. Mean per-sample cost is reported as \$0.42, and runtime is about half an hour per idea, although parallel processing is said to support approximately 100 samples per hour (Qiao et al., 16 Feb 2026). Code and data are open-sourced, and a live demo is provided.

The framework includes several explicit safeguards. Hybrid ranking is intended to reduce both embedding-only fragility and LLM-judge bias. Grounding is used to suppress irrelevant or noisy retrieval. Persona diversity addresses single-judge subjectivity. Timestamp-aware retrieval separates fair historical evaluation from future-oriented revision suggestions (Qiao et al., 16 Feb 2026).

The stated limitations are equally explicit. The current scope is AI-focused, and generalization to biology, medicine, physics, and other sciences is left to future work. The system is text-only rather than multimodal. Its deep heterogeneous search and multi-perspective review are computationally heavier than simpler evaluators. The paper also emphasizes that InnoEval should augment, not replace, human expert judgment, since hallucinations and uneven source quality remain risks even under grounding and reviewer diversification (Qiao et al., 16 Feb 2026).

These limitations define the framework’s present boundary conditions. A plausible implication is that InnoEval is best understood not as an autonomous referee but as a structured, evidence-heavy decision support layer for idea screening, comparison, and revision.

Outside research-idea evaluation, “InnoEval” and closely related formulations have been used to denote innovation-oriented evaluation in several adjacent literatures. In the InnoGym framework, InnoEval refers to evaluating the innovation potential of AI agents through two complementary quantities: performance gain,

$V$ 9

and novelty,

$E_{\text{point}}$ 0

with 18 curated Improvable Tasks and a unified execution environment called iGym (Zhang et al., 1 Dec 2025). That formulation emphasizes the joint assessment of creativity and effectiveness rather than idea review.

In information retrieval, innovation-oriented evaluation has been operationalized through rareness-based modifications of classical ranking metrics. The document rareness function is defined as

$E_{\text{point}}$ 1

leading to

$E_{\text{point}}$ 2

and

$E_{\text{point}}$ 3

These metrics reward systems that retrieve rare relevant documents missed by most competing runs, thereby encouraging methodological diversity across participants (Türkmen et al., 2023).

A separate line of work uses “InnoEval” as the name of a proposed evaluation infrastructure rather than a fixed metric. A synthesis based on EvalAI describes such a framework as a containerized, queue-decoupled platform supporting static tasks, interactive agents, human-in-the-loop evaluation, remote evaluation, arbitrary phases and splits, and public/private leaderboards (Yadav et al., 2019). Another proposal, derived from a blockchain-based decentralized framework for collaborative LLM evaluation, argues that InnoEval should use multi-run, multi-environment evaluation, commit-reveal protocols, median-based consensus, staking, redundancy, and confidence intervals. In that setting, the reported standard deviation on HumanEval drops from 1.67 in centralized repeated runs to 0.28 under the decentralized framework (Yang et al., 9 Feb 2026).

TaskEval, implemented as GenValidator, is also presented as relevant to an InnoEval initiative for foundation-model applications. It contributes a task-agnostic meta-model, an interaction protocol for eliciting evaluation requirements, and an eval synthesiser that selects or generates task-specific evaluators and UIs. Its preliminary “evaluating the eval” results report 93% accuracy for chart data extraction and 90% for document question answering (Widanapathiranage et al., 4 Dec 2025).

Taken together, these usages reveal that “InnoEval” has become an umbrella label for evaluation schemes that try to reward innovation, recover methodological diversity, or improve evidential and statistical rigor. The 2026 research-idea framework remains the most direct and fully specified use of the name, but the broader family shares a common orientation: evaluation should not reduce to a single score produced by a single judge under a single source of evidence (Qiao et al., 16 Feb 2026).