Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

Published 21 Feb 2025 in cs.LG, cs.AI, cs.CL, and cs.DC | (2502.15964v1)

Abstract: We investigate an emerging setup in which a small, on-device LLM (LM) with access to local data communicates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. Can a local-remote collaboration reduce cloud inference costs while preserving quality? First, we consider a naive collaboration protocol where the local and remote models simply chat back and forth. Because only the local model reads the full context, this protocol achieves a 30.4x reduction in remote costs, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we study an extension of this protocol, coined MinionS, in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MinionS reduces costs by 5.7x on average while recovering 97.9% of the performance of the remote model alone. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Abstract PDF Upgrade to Chat

Summary

The paper proposes a hybrid protocol where a cloud 'Manager' decomposes tasks for multiple on-device 'Minions', significantly reducing inference cost.
It mitigates instruction-following and long-context reasoning challenges by assigning simpler, context-specific subtasks to on-device language models.
Experimental results demonstrate 97.9% performance retention with a 5.7x token reduction compared to full-context cloud inference.

The paper "Minions: Cost-efficient Collaboration Between On-device and Cloud LLMs" (2502.15964) investigates strategies for enabling a small, resource-constrained on-device LLM (LM) to leverage the capabilities of a large, powerful cloud-based LM for complex tasks involving long documents, while minimizing the costs associated with cloud inference. The core problem addressed is the trade-off between inference cost (primarily token count sent to the cloud model) and task performance in scenarios requiring access to local, potentially private, data.

Problem Formulation and Baseline Approach

The scenario involves a user query requiring reasoning over a long document (e.g., financial reports, medical records, scientific papers) stored locally on a device. The device hosts a small LM (SLM) capable of accessing this local context, while a large LM (LLM) resides in the cloud. The goal is to answer the query accurately without sending the entire lengthy document to the cloud LLM, thereby reducing API costs and potentially improving privacy.

A naive baseline protocol involves a simple conversational exchange:

The SLM receives the user query and the local document.
The SLM sends the query (but not the full document) to the LLM.
The LLM, lacking full context, might ask clarifying questions or provide instructions for the SLM to execute locally (e.g., "Find the section discussing financial projections").
The SLM executes these instructions using its access to the local document and sends the results back to the LLM.
This back-and-forth continues until the LLM can synthesize a final answer.

While this naive protocol achieves significant cost reduction (reported as 30.4x fewer tokens sent to the remote LLM compared to sending the full context), it suffers from performance degradation, achieving only 87% of the standalone LLM's performance on the evaluated tasks. The paper identifies two primary failure modes for the SLM in this setup:

Instruction Following: The SLM struggles to accurately follow complex, multi-step instructions issued by the LLM.
Long-Context Reasoning: Despite having access, the SLM's limited capacity prevents it from effectively reasoning over or extracting information from the entire long document when required by the LLM's instructions.

The Minions Collaboration Protocol

To address the limitations of the naive approach, the paper proposes the Minions protocol. This protocol reframes the interaction: the remote LLM acts as a "Manager," decomposing the main task into simpler, independent subtasks that can be executed in parallel by multiple instances of the SLM, termed "Minions," each operating on a distinct, shorter chunk of the original document.

The workflow of the Minions protocol is as follows:

Initialization: The user query and the local document (context $C$ ) are provided. The document $C$ is partitioned into $k$ chunks: $C = \{c_1, c_2, ..., c_k\}$ .
Task Decomposition: The Manager LLM receives the user query $Q$ . It does not receive the full context $C$ . Based on $Q$ , the Manager LLM formulates a plan, breaking the original task into $m$ independent subtasks $\{ S_1, S_2, ..., S_m \}$ . Each subtask $S_j$ is designed to be answerable by referencing only one or a small number of context chunks $c_i$ . The decomposition prompt instructs the Manager LLM to generate sub-questions or instructions suitable for parallel execution by workers (Minions) who will only see specific parts of the document.
Subtask Assignment and Parallel Execution:
- The decomposition plan specifies which context chunk(s) are relevant for each subtask $S_j$ .
- Multiple instances of the SLM (Minions) are invoked locally. Each Minion $j$ is assigned a subtask $S_j$ and the corresponding relevant context chunk(s) $c_i$ .
- All Minions execute their assigned subtasks in parallel on the local device. Let the result generated by Minion $j$ for subtask $S_j$ be $R_j$ .
Result Aggregation: The results $\{R_1, R_2, ..., R_m\}$ from all Minions are collected.
Final Synthesis: The original query $Q$ and the aggregated results $\{R_1, ..., R_m\}$ are sent to the Manager LLM. The Manager LLM synthesizes the final answer based on the query and the partial results provided by the Minions.

This approach mitigates the naive protocol's weaknesses:

Instruction Following: Subtasks are designed to be simpler and more self-contained, making them easier for the SLM to handle.
Long-Context Reasoning: Each Minion only processes a short chunk, avoiding the need for the SLM to handle the entire long document simultaneously. The Manager LLM implicitly handles the long-context reasoning during the decomposition and final synthesis phases.

Implementation Details and Considerations

Implementing the Minions protocol requires careful consideration of several components:

Model Selection: The paper uses models like Phi-2 (1.3B/2.7B parameters) as the on-device SLM (Minion) and GPT-4 as the cloud LLM (Manager). The choice depends on device capabilities and desired performance/cost trade-offs. The SLM needs sufficient capability to execute the decomposed subtasks accurately.
Context Chunking: The document partitioning strategy is crucial. Fixed-size chunks with potential overlap are common. The chunk size impacts the granularity of subtasks and the context window requirements for the SLM. Optimal chunk size may vary depending on the task and document structure. The paper experimented with chunk sizes around 1000 tokens.
Task Decomposition Prompting: The quality of the Manager LLM's task decomposition is critical. The prompt engineering needs to guide the LLM to generate:
- Subtasks that are truly independent or have minimal dependencies.
- Subtasks answerable from single chunks.
- A comprehensive set of subtasks covering the original query.
- Clear specification of relevant chunks for each subtask.
- Example decomposition instructions might guide the LLM to "Break down the query into sub-questions that can be answered independently by looking at small, localized parts of the document."
Parallel Execution Framework: On the device, a mechanism is needed to manage the parallel execution of Minion instances. This could involve multithreading or multiprocessing, depending on the platform. Resource management (memory, compute) is essential to avoid overwhelming the device. The number of parallel Minions ( $m$ ) is a tunable parameter influencing latency and resource usage.
Communication: The protocol involves two main communication points with the Manager LLM: one for task decomposition (sending the query) and one for final synthesis (sending the query and aggregated Minion results). The bulk of the context (the document chunks) remains local, processed only by the Minions.
Cost Calculation: The primary cost driver is the number of tokens processed by the Manager LLM. In Minions, this includes tokens for the initial query, the generated decomposition plan (if returned explicitly, though often implicitly used), the aggregated results from Minions, and the final synthesized answer. This is significantly less than sending the entire document.

A simplified pseudocode representation:

def run_minions_protocol(query, document, manager_llm, minion_slm, chunk_size):
    # 1. Chunk the document
    chunks = chunk_document(document, chunk_size) # List of strings

    # 2. Task Decomposition (Remote Call 1)
    decomposition_prompt = f"Given the query '{query}', break it down into independent sub-tasks. For each sub-task, specify which chunk index (0 to {len(chunks)-1}) is likely relevant. Output should be a list of {'sub_task': '...', 'chunk_indices': [...]}, ..."
    # In practice, the LLM might be prompted to generate sub-questions directly
    # or formulate a plan requiring information extraction.
    sub_tasks_description = manager_llm.generate(decomposition_prompt)
    sub_tasks = parse_decomposition(sub_tasks_description) # List of {'sub_task': str, 'chunk_indices': list[int]}

    # 3. Parallel Subtask Execution (Local)
    minion_results = {} # Dictionary to store results: {sub_task_id: result}
    active_minions = []
    for i, task_info in enumerate(sub_tasks):
        relevant_chunks = [chunks[idx] for idx in task_info['chunk_indices']]
        context_for_minion = " ".join(relevant_chunks)
        minion_prompt = f"Context: {context_for_minion}\n\nSub-task: {task_info['sub_task']}\n\nAnswer:"
        # Spawn minion instance asynchronously
        active_minions.append(
            run_minion_async(minion_slm, minion_prompt, sub_task_id=i)
        )

    # Wait for minions and collect results
    for future in asyncio.as_completed(active_minions): # Example using asyncio
        sub_task_id, result = await future
        minion_results[sub_task_id] = result

    # 4. Format Aggregated Results
    aggregated_results_text = ""
    for i, task_info in enumerate(sub_tasks):
        result = minion_results.get(i, "Error executing subtask.")
        aggregated_results_text += f"Sub-task {i} ({task_info['sub_task']}): {result}\n"

    # 5. Final Synthesis (Remote Call 2)
    synthesis_prompt = f"Original Query: {query}\n\nResults from sub-tasks:\n{aggregated_results_text}\n\nSynthesize the final answer:"
    final_answer = manager_llm.generate(synthesis_prompt)

    return final_answer

Experimental Results and Analysis

The Minions protocol was evaluated on question-answering tasks requiring reasoning over long documents from domains like finance (earnings calls), medicine (PubMed abstracts), and science (scientific paper reviews).

Performance: Minions recovered 97.9% of the performance (measured by metrics like ROUGE or accuracy, depending on the task) achieved by the Manager LLM (GPT-4) when provided with the full context directly. This significantly outperforms the naive protocol's 87%.
Cost Reduction: Minions reduced the number of tokens sent to the remote Manager LLM by an average factor of 5.7x compared to the full-context baseline. While less drastic than the naive protocol's 30.4x reduction, it comes with substantially higher quality.
Trade-offs: The paper analyzes the impact of design choices:
- Chunk Size: Smaller chunks increase parallelism but might fragment information needed for some subtasks. Larger chunks reduce parallelism but provide more local context per Minion.
- Number of Minions: More minions allow higher parallelism but increase local resource consumption.
- Decomposition Strategy: The effectiveness of the Manager LLM's decomposition heavily influences overall performance.

The results demonstrate that task decomposition and parallel local execution provide a favorable trade-off point, significantly reducing cloud interaction costs while largely preserving the quality benefits of using a powerful cloud LLM.

Practical Implications and Deployment

The Minions architecture offers a practical approach for deploying advanced LM capabilities in hybrid edge-cloud environments, particularly where data locality or privacy is a concern, or where network bandwidth/cost is constrained.

Applications: Financial analysis on local reports, medical summarization of patient records on physician devices, interactive querying of large local codebases or documentation.
Deployment Considerations:
- Requires sufficient computational resources on the local device to run multiple SLM instances concurrently.
- Latency includes two calls to the remote LLM plus the parallel execution time of the slowest Minion. While parallel execution helps, the overhead of decomposition and synthesis adds latency compared to a single remote call (if feasible).
- Error handling is important: How are failures in Minion execution or poor subtask results handled during the final synthesis? The Manager LLM might need robustness mechanisms.
- The complexity lies significantly in the Manager LLM's ability to perform effective task decomposition via prompting or fine-tuning.

The framework presents a structured method for leveraging the respective strengths of local and cloud models – local access and parallelizability for the SLM, and complex reasoning/planning for the LLM – in a cost-effective manner.

Conclusion

The Minions protocol (2502.15964) presents a compelling strategy for hybrid local-remote LM collaboration. By employing a Manager LLM to decompose tasks for parallel execution by local Minion SLMs operating on document chunks, it achieves substantial remote inference cost savings (5.7x reduction) while maintaining high task performance (97.9% of the remote-only baseline). This structured decomposition approach overcomes key limitations of simpler chat-based protocols, offering a practical blueprint for deploying sophisticated LM applications in resource-constrained or privacy-sensitive environments. Key implementation aspects involve effective task decomposition prompting, efficient local parallel execution, and managing the inherent cost-performance trade-offs.

Markdown