Multi-Hop LLM Pipeline

Updated 9 November 2025

Multi-Hop LLM Pipeline is a modular system that divides complex tasks into discrete reasoning hops using tailored prompts and strategic model assignments.
It employs an initial low-cost binary filter followed by a fine-grained classifier, balancing simple and nuanced tasks to optimize performance and cost.
Empirical results reveal up to an 18.4% gain in agreement metrics at significantly reduced costs, demonstrating the pipeline’s efficient and scalable design.

A multi-hop LLM pipeline is a modular, staged system that decomposes a complex task—such as document relevance assessment, fact verification, or multi-step reasoning—into a sequence of processing "hops" or stages. Each stage applies a specific LLM, potentially of different sizes or capabilities, with customized prompt templates and objectives. This architecture enables improved accuracy and efficiency by allocating the simplest subtasks to smaller, cost-effective models while reserving larger or more capable models for tasks requiring nuanced judgment or finer granularity.

1. Pipeline Decomposition and Architecture

A prototypical multi-hop LLM pipeline, as described in (Schnabel et al., 24 Jan 2025), divides document relevance assessment into two sequential hops:

Coarse Binary Classification: The first hop applies a small, inexpensive LLM to filter passages as “irrelevant” or “potentially relevant.” This model executes a prompt that requires decomposition into explicit reasoning steps:
- Infer user intent
- Match passage content to that intent
- Assess trustworthiness
- Output a binary label (##final score: 0 or 1)
Fine-Grained Classification: Passages labeled as potentially relevant proceed to a second hop, where a custom prompt elicits a multi-class label indicating the degree of relevance (e.g., related, highly relevant, or perfectly relevant). This stage may use the same model or a larger, more capable LLM.

A general pseudocode abstraction is:

def multi_hop_assess(query, passages):
    for doc in passages:
        bin_score = Model1.run(prompt_binary(query, doc))
        if bin_score == 0:
            yield 0
        else:
            fin_score = Model2.run(prompt_three_level(query, doc))
            yield fin_score  # 1, 2, or 3

Model assignments:

Hop 1 (binary filter): GPT-4o mini ($0.15 per M tokens), flagship ($5.00 per M tokens)
Hop 2 (three-level): GPT-4o mini or flagship

Prompt specialization: Prompts at each hop are optimized for their subtask—explicit, stepwise, and concise.

2. Workflow, Methodology, and Cost Analysis

The chaining logic is as follows:

Each query–passage pair is first adjudicated by the binary filter (Model1).
Passages failing this filter are labeled irrelevant, terminating further work and maximizing token savings.
Only those passing the filter are escalated to the fine-grained classifier (Model2).
Optionally, additional hops may be inserted (e.g., to resolve classifier disagreements).

The cost model is:

$\text{Cost} = \text{rate}_1 + \text{rate}_2 \times (1 - r_0)$

where $r_0$ is the fraction filtered as irrelevant at hop 1. For TREC-DL, $r_0 \approx 0.75$ , so for mini → mini: $\text{Cost} \approx \%%%%1%%%%0.15 \times 0.25 = \$ 0.21 $per million tokens. Evaluation: <ul> <li>Krippendorff’s alpha ($ \alpha = 1 - D_o/D_e$) for multi-class labeling agreement

Cohen’s kappa for binary tasks

3. Empirical Results and Comparative Analysis

Method	α (Krippendorff)	Cost per M tokens (USD)
GPT-4o flagship	0.408	5.00
GPT-4o mini	0.359	0.15
Two-stage mini→mini	0.425 (+18.4%)	0.21
Two-stage mini→flagship	0.446 (+9.7%)	2.05

Key findings:

Two-stage pipelines (mini→mini) increase α by 18.4% over the GPT-4o-mini baseline at just 40% additional cost, and still ∼25× cheaper than flagship-only inference.
A mini→4o pipeline surpasses even the flagship for a 9.7% gain in α, at less than half the cost.
The staged architecture enables better or equal performance at dramatically lower cost, offering a Pareto-dominated baseline for single-stage systems.

4. Design Principles and Implementation Best Practices

Central guidelines for multi-hop LLM pipeline construction:

Divide-and-Conquer: Allocate inexpensive models to resolve simple, high-volume tasks (e.g., label-0 filtering), preserving costly model resources for ambiguous or nuanced inputs.
Prompt Specialization: Develop stage-specific, explicit prompts aligned to the subtasks' reasoning requirements.
Model Ordering: Place the fastest filtering model at the initial hop. Validate that its recall on relevant items remains high (e.g., >95%) to avoid discarding true positives.
Scalability and Modularity: Each hop operates as an independent module. This allows for:
- Easy integration of additional filters (e.g., spam, topic, or style classifiers)
- Plug-and-play replacement of model variants (as new models supersede old)
- Minimal impact on pipeline orchestration when individual hops are updated

5. Generalization to Broader Multi-Hop Reasoning

The multi-hop pipeline abstraction extends to a range of LLM use cases:

Multi-turn QA:

1. Identify relevant documents 2. Extract candidate answer spans 3. Aggregate or synthesize the final answer

Stepwise Multi-hop Synthesis:

1. Generate a high-level plan (chain-of-thought or outline) 2. Elaborate on each step 3. Verify and consolidate the response

Adaptation best practices:

Formulate hop-specific objectives and structure prompts accordingly.
Allocate model classes by computational cost and hop complexity.
Explicitly estimate token savings versus performance from early stopping and additional hops.

6. Impact, Limitations, and Future Directions

The multi-hop pipeline paradigm achieves significant performance and cost benefits by exploiting modularity, prompt specialization, and staged inference. In the experimental context of TREC Deep Learning, this approach outperformed the strongest available single-stage systems—including the flagship GPT-4o—by more than 9% in agreement metrics and at a fraction (1/25th) of the cost (Schnabel et al., 24 Jan 2025).

A prominent limitation is the need to verify that early filtering hops maintain high recall; otherwise, error propagation is possible. Additional hops may be inserted to resolve ambiguous cases, but with diminishing returns and increased complexity. As model variants proliferate and tasks diversify, the modular pipeline strategy offers a principled, practical blueprint for scalable and adaptive LLM deployment across a variety of multi-step reasoning, relevance assessment, and answer synthesis tasks.

PDF Markdown Chat (Pro)

References (1)

Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Hop LLM Pipeline.

Multi-Hop LLM Pipeline

1. Pipeline Decomposition and Architecture

2. Workflow, Methodology, and Cost Analysis

3. Empirical Results and Comparative Analysis

4. Design Principles and Implementation Best Practices

5. Generalization to Broader Multi-Hop Reasoning

6. Impact, Limitations, and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Hop LLM Pipeline

1. Pipeline Decomposition and Architecture

2. Workflow, Methodology, and Cost Analysis

3. Empirical Results and Comparative Analysis

4. Design Principles and Implementation Best Practices

5. Generalization to Broader Multi-Hop Reasoning

6. Impact, Limitations, and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research