Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 145 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

PaperRegister: Hierarchical Academic Search

Updated 19 August 2025

The paper introduces a hierarchical register index that decomposes academic papers into multiple layers for precise retrieval at varying granularities.
It employs an adaptive retrieval mechanism with view-based matching to effectively process both broad and fine-grained queries.
Empirical evaluations demonstrate significant recall improvements, with scores reaching up to 84.1, highlighting its superiority over abstract-based approaches.

PaperRegister is a flexible-grained academic paper search system featuring hierarchical register-based indexing and an adaptive retrieval mechanism. Designed to address shortcomings of prior systems—most of which rely on superficial abstract-based indices—PaperRegister indexes papers at multiple levels of detail, enabling retrieval both for broad topics and extremely fine-grained aspects such as method configuration or training procedures. Its architecture combines offline full-text parsing, semantic aggregation, and a lightweight, modular online retrieval pipeline adaptable to arbitrary query granularity. Empirical evaluations demonstrate significant gains over conventional approaches, especially for fine-grained search scenarios.

1. Hierarchical Indexing Structure

PaperRegister abandons the traditional abstract-based indexing paradigm in favor of a hierarchical register index tree, constructed for each document as follows:

Hierarchical Schema: Each paper is decomposed into a register structure defined by a schema with L layers. Each node N₍ᵢ,ⱼ₎ specifies a particular information unit (e.g., Method Implementation, Module Configuration, Training Operation):

$N_{(i, j)} = \{ n_{(i, j)} : (c_{(i, j)}, \{ N_{(i+1, j')} \}_{j'=1}^z ) \}$

Fine-Grained Content Extraction: An LLM (Qwen3-32B) is used to extract content for all leaf nodes (deepest layer), ensuring detailed implementation-specific aspects are captured.
Bottom-Up Content Aggregation: Higher-level (coarser) nodes are obtained by recursively aggregating their children’s content, effectively summarizing details to support queries at higher granularity.
Corpus Index Construction: The register trees from all papers are merged and referenced in a global multi-level index $\mathcal{J}_h$ :

$\mathcal{J}_h = \left\{ \left\{ I_{(i, j)} \right\}_{j=1}^z \right\}_{i=1}^L$

This design systematically preserves both high-level overviews and deep technical specifics, allowing subsequent retrieval to dynamically target any layer as dictated by the query's informational granularity.

2. Adaptive Retrieval Mechanism

The online retrieval engine uses a two-stage process to handle queries of arbitrary specificity:

View Identifying (Adaptive Query Analysis):
- A specialized, efficient LLM (0.6B parameters, fine-tuned via supervised training and hierarchical-reward Group Relative Policy Optimization) analyzes the incoming query $q$ to infer the most relevant node paths (views) $\{v_k\}_{k=1}^K$ in the register schema.
- Beam search and a restricted decoding strategy (using prefix trees) ensure valid paths corresponding to semantic levels in the hierarchical index.
View-Based Matching:
- For each candidate view $v_k$ , a lookup retrieves the relevant document node contents $\{I_k\}_{k=1}^K = M_{\text{lookup}}(\mathcal{J}_h, \{v_k\})$ .
- A matching module (BM25 for sparse retrieval, DPR with gte-Qwen2-7B-instruct for dense retrieval) computes, for each document $p^{(m)}$ ,
$s(q, p^{(m)}) = \max_k \left\{ M_{\text{rel}}(q, c_k^{(m)}) \right\}$

where $c_k^{(m)}$ is the content at node $v_k$ for paper $m$ .

This view-centric matching enables flexible operation, supporting both coarse-topic queries (matching higher nodes like “method overview”) and highly technical sub-aspect queries (e.g., “module configuration” or “training operation” in deep layers).

3. Performance Evaluation

Experimental results are reported across LitSearch (coarse) and F.g.Search (fine to very fine granularity) datasets. The key findings are:

Superior Fine-Grained Performance: Under DPR-based matching, PaperRegister achieves recall@5 scores of 81.0 (LitSearch), 84.1 (F.g.Search-1), 79.9 (F.g.Search-2), and 80.8 (F.g.Search-3), substantially outperforming abstract-based baselines (which drop to 58.2 recall@5 on very fine-grained queries).
Granularity-Dependent Gains: Performance improvements increase with granularity; for example, at fine granularity, PaperRegister surpasses abstract-only retrieval by more than 22 percentage points in recall@5.
Ablation Evidence: Removing any layer from the hierarchical register substantially degrades performance, indicating that multi-granularity organization is essential for flexible-grained retrieval.
Efficiency: Online query response time is under 2.5 seconds, competitive with or exceeding other multi-stage search systems.

4. Real-World Applications

PaperRegister’s hierarchical, multi-level approach enables a number of use cases impractical for prior systems:

Tailored Literature Search: Researchers can precisely target desired details, such as “joint optimization of encoder and generator via negative log-likelihood,” and retrieve papers annotated at the corresponding sub-node.
Technical Replication and Comparison: Detailed configuration and operational settings required for replication or benchmarking can be retrieved directly, facilitating reproducibility and method comparison.
Integration with Complex Search Frameworks: Experiments using the PaSa academic search stack show PaperRegister can seamlessly replace standard retrieval modules, improving accuracy in reranking chains and iterative search.
Interactive, Low-Latency Environments: The modular, fast online pipeline supports interactive user-facing search APIs, essential for recommendation engines, academic browsers, and research assistants.

5. Codebase and Implementation

The PaperRegister codebase is available at https://github.com/Li-Z-Q/PaperRegister. Its architecture comprises:

Offline Hierarchical Indexer: Uses LLM extraction and aggregation to build the hierarchical tree.
View Recognizer: A model trained with both supervised fine-tuning and hierarchical-reward GRPO, maintaining a lightweight footprint for rapid online inference.
Matching Modules: Supporting both BM25 (rank-bm25) and dense passage retrieval (gte-Qwen2-7B-instruct) for flexible adaptation.
Integration Tools: Restricted decoding and modular API interfaces facilitate incorporation into broader search platforms or domain-specific applications.

The system design supports extensibility for new domains, schema evolution, and provides control over the extraction pipeline (e.g., modifying the LLM prompt schema or aggregation strategy).

6. Significance for Flexible-Grained Search

PaperRegister’s core innovation is the transition from an abstract-only index to a multi-layered register tree permitting fine-tuned and scalable query granularity. This enables:

Unified Retrieval Model: Supporting both topic-level and detail-level search over the same index structure.
Consistency of Relevance: Adaptation of both the index and retrieval logic ensures high recall and ranking quality, even for technical queries whose semantics are not present in conventional paper abstracts.
Empirical and Practical Validation: The methodology yields state-of-the-art performance over a spectrum of granularity settings and query types, and the documented ablation and runtime results confirm both its necessity and practicality.

7. Future Directions and Extensibility

While PaperRegister is optimized for academic paper search, its methodological components may apply across other domains requiring hierarchical flexible-grained indexing (e.g., technical documentation, code search, patents). The presence of a modular pipeline suggests straightforward extension to new information types and adaptive schema modifications as research documentation practices evolve.

PDF Markdown Chat (Pro)