Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

Published 2 Apr 2026 in cs.IR, cs.AI, cs.CL, and cs.DL | (2604.01965v1)

Abstract: Scientific knowledge discovery increasingly relies on LLMs, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned LLMs to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that hybrid retrieval with small open models nearly matches the performance of larger LLMs across diverse scholarly tasks.
The methodology integrates dense retrieval from unstructured texts with knowledge graph queries to enhance citation transparency and computational efficiency.
Evaluation reveals trade-offs between extended prompts and fluency, highlighting the need for robust retrieval design in domain robustness and scientific text compression.

Task-Aware Retrieval with Small LLMs for Scholarly Applications

Introduction

This work addresses the reliability and accessibility concerns inherent in current scientific LLM (LM) applications, which largely depend on proprietary, large-scale LLMs. The authors propose an explicit investigation: to what extent can advances in retrieval design enable competitive scientific assistants using small, open-weight instruction-tuned models? They introduce a modular retrieval-augmented pipeline with task-aware routing that integrates hybrid evidence from unstructured scientific texts and structured knowledge graphs, targeting the full spectrum of research question answering (QA), factual lookups, and scientific text compression. The framework prioritizes reproducibility, citation transparency, and computational efficiency.

The system builds on the Retrieval-Augmented Generation (RAG) paradigm [lewis2021retrievalaugmentedgenerationknowledgeintensivenlp], which has demonstrated robust factual grounding but is typically coupled with large parametrized backbones. Prior approaches, such as OpenScholar (Asai et al., 2024), utilize iterative reranking and feedback mechanisms atop large models for scholarly QA. The authors diverge by systematically analyzing retrieval and task routing with a strong emphasis on lightweight architectures, integrated with hybrid evidence sources (unarXive for full texts [saierUnarXive2022All2023], SemOpenAlex for metadata [faerber2023semopenalex]). Hybrid and KG-based access have received attention but are rarely unified in a way that prioritizes interpretability and resource efficiency. The contribution thus questions the dominant narrative of increased model scaling for handling scientific complexity.

System Design

The pipeline is composed of three interlocking modules: task-aware query routing, flexible hybrid context retrieval, and generation with a compact, instruction-tuned Llama 3.2 3B model [llama3_2]. Task routing predicts the query’s informational intent with a dedicated classifier, directing it to one of several defined categories: general QA, summarization/simplification, or structured KG-based fact lookup.

General QA tasks use dense retrieval via FAISS over unarXive embeddings to surface passages relevant to multi-paper questions, supplementing with metadata lookups.
Summarization/Simplification tasks employ title detection and NER to ground requests in specific papers, defaulting to generic summarization if no robust match is found.
KG-fact queries are routed through template-driven SPARQL over SemOpenAlex, enabling high-precision retrieval of structured metadata (e.g., citation counts, DOIs, affiliations).

For all paths, the retrieved evidence is concatenated with the original query using task-specific templates, creating explicit prompts that encourage the generator to cite sources and avoid relying exclusively on parametric knowledge. All outputs include explicit citations.

Evaluation

Multi-Paper QA

On multi-document QA (ScholarQABench-Multi), compact models (3B, 8B) with retrieval and fine-tuning approach the scores of large proprietary or open models. For example, full fine-tuned 3B models achieve mean LLM judge scores of 4.00 on a 1–5 scale—a narrow gap relative to the 8B baselines (4.04). Notably, the use of retrieval is essential for citation-supported responses, with F1 citation quality peaking near 25% for small models. However, extended prompts with multi-passage retrieval can degrade fluency and organization, particularly under LoRA adaptation or when prompt length increases.

Single-Paper QA and Domain Robustness

In transfer settings using PubMedQA (biomedical QA), even lightweight models with task-aware retrieval achieve accuracy within 10% of BioBERT and comparable large LLMs on the original setup. Performance drops in retrieval-based variants are attributed to a domain shift (unarXive primarily covers CS/physics, not biomedicine), thus highlighting retrieval coverage as a bottleneck for cross-domain robustness. In zero-context settings, base models sometimes outperform adapted ones, indicating a trade-off between specialization and general parametric recall.

Scientific Text Compression

For extreme summarization (SciTLDR), the system’s compressive summaries and readability metrics indicate that specialized models (e.g., CATTS) retain a clear advantage. Fine-tuned Llama3 3B improves over the base, but generated summaries remain longer and less readable compared to expert TLDRs. Readability (as measured by SMOG or BERTScore) is only moderately improved, further exposing the limits of small models in high-compression scenarios.

Discussion and Implications

The findings demonstrate that retrieval and model scale are complementary rather than interchangeable. Careful retrieval design, particularly when paired with targeted fine-tuning, allows small open-weight models to operate near the performance envelope of much larger counterparts on evidence-grounded QA. However, model capacity becomes critical as the complexity of synthesis or the need for implicit (parametric) reasoning increases. Furthermore, hybrid retrieval pipelines introduce new sensitivities: longer, denser prompts can induce organizational errors in smaller models, and reliance on dense (vector) retrieval may yield spurious passages, resulting in noisy grounding.

From a resource perspective, the work underscores the viability of compact, open LMs for reproducible, transparent scholarly assistants—subject to robust retrieval and coverage. Practically, this opens the door for more accessible, environmentally sustainable deployment in academic and scientific institutions, alleviating the dependency on centralized proprietary systems. The approach also enables explicit source tracking, aligning with the rigorous citation standards required in scientific communication.

Theoretically, these results motivate a continued research agenda targeting improved retrieval coverage (especially for cross-domain QA), adaptive prompt compression, lightweight dynamic reranking for improved context quality, and post-generation verification for reliability.

Future Directions

Anticipated avenues include:

Enhanced knowledge graph evaluation resources, enabling systematic benchmarks for structured scholarly fact queries.
Iterative retrieval with reranking adapted for small-LM context limits.
Automated source verification pipelines, closing the loop on citation correctness.
More adaptive, hierarchical task routing for dynamic scholarly needs.
Exploration of knowledge distillation [(Sanh et al., 2019), moslemi_survey_2024] to further compress domain reasoning into efficient student models.

Conclusion

This work advances the discussion on efficient, transparent scientific assistants by demonstrating that small instruction-tuned LMs, when augmented with robust, task-aware retrieval (integrating structured and unstructured evidence), can achieve competitive performance on diverse scholarly tasks. The results articulate a nuanced view: retrieval augmentation can partially offset the need for extreme model scaling, but does not supplant the need for sufficient model capacity when faced with complex reasoning, noisy retrievals, or substantial domain shift. Ongoing improvements in retrieval, routing, and adaptive context handling, combined with open implementation practices, are critical for practical and reliable scholarly AI systems.

Reference:

"Do We Need Bigger Models for Science? Task-Aware Retrieval with Small LLMs" (2604.01965)

Markdown Report Issue