LongPiBench: LLM Bias & Edge Deployment
- LongPiBench is a benchmarking framework that systematically examines positional biases in LLMs and evaluates on-device inference performance on single-board computers.
- It employs controlled experiments to isolate absolute and relative positional effects, highlighting the challenges in aggregating dispersed contextual information.
- The framework also assesses key hardware metrics such as throughput, RAM, and power consumption for quantized LLMs, informing optimal edge deployment strategies.
LongPiBench is a benchmarking framework designed to systematically assess two complementary frontiers in LLM research: (1) the positional biases encountered by transformer LLMs when aggregating multiple relevant information pieces in lengthy contexts (Tian et al., 2024), and (2) the deployment performance of quantized LLMs on single-board computers (SBCs), with a focus on low-cost, resource-constrained edge inference (Tung et al., 20 Oct 2025). By providing rigorously controlled tasks and unified evaluation pipelines, LongPiBench enables in-depth analysis of both model-level capabilities and practical on-device deployment limitations.
1. Motivation and Scope
LongPiBench was conceived to fill critical gaps left by previous LLM benchmarks:
- Positional Aggregation Bias: While the "lost in the middle" bias—LLMs ignoring single relevant facts placed at context center—has been substantially mitigated in modern long-context models, tasks requiring simultaneous retrieval and integration of several relevant items, with variable pairwise spacing, remain a prominent source of failure. Existing benchmarks such as Needle‐in‐a‐Haystack and LooGLE do not probe this regime, as their focus remains single-item, absolute position, or search-localization tasks (Tian et al., 2024).
- Edge Deployment on SBCs: The surging need for privacy-preserving, affordable on-device inference requires rigorous evaluation of LLM throughput, memory, and energy under SBC resource constraints. Prior systematic evaluations on platforms like Raspberry Pi and Orange Pi were lacking (Tung et al., 20 Oct 2025).
LongPiBench thus constitutes a dual-purpose suite: one targeting systematic measurement of context-length-related biases in LLM architecture and another establishing empirical constraints for lightweight inference on commodity hardware.
2. Benchmark Construction and Experimental Design
2.1 Positional Bias Assessment
Each LongPiBench instance for positional bias consists of relevant elements within a context of configurable length (up to 256K tokens), embedded among distractor "noise." Two orthogonal experimental axes are controlled:
- Absolute Position: The context is divided into equal segments. All relevant elements are placed within segment to set their mean absolute location
Varying realizes absolute-position profiles.
- Relative Position: The initial location of the first relevant element is fixed; the inter-element distance is controlled by
where yields all relevant elements adjacent (), while distributes them uniformly through the context. This explicitly probes the model's capacity to aggregate distant facts, disentangled from absolute context location.
Multiple tasks of varying complexity are instantiated: Table SQL retrieval, Timeline Reordering, and Equation Solving. To prevent spurious performance from memorization, data undergoes knowledge masking and manual correction.
2.2 On-Device Inference Evaluation
The hardware benchmarking arm assesses 25 q4_k_m-quantized open-source LLMs (up to 7B parameters) across three SBCs—Raspberry Pi 4, Raspberry Pi 5, and Orange Pi 5 Pro—using the Ollama and Llamafile inference runtimes. Experiments employ three prompt lengths, four execution trials per setting, and metric logging of:
- Tokens per second (TPS) throughput
- Peak RAM usage (MB)
- Mean power consumption (W)
CPU scaling is explored (varying cores from 4–8), and model/board/routine configurations span practical deployment scenarios.
3. Metrics and Evaluation Protocols
3.1 Positional Bias: Task Recall
The primary metric for positional bias is recall:
where is the ground-truth set and the model output. Performance is analyzed over absolute and relative placement levels, isolating the drop in average recall:
- Absolute:
- Relative:
3.2 Device Performance: Throughput, RAM, and Power
Throughput is defined as:
Power is reported as average consumption during inference, and RAM as peak usage.
Performance is reported both aggregated and per-board/runtime/model size for reproducibility.
| Model | Parameters | Board | Runtime | Throughput (TPS) | Peak RAM (MB) | Power (W) |
|---|---|---|---|---|---|---|
| TinyLlama | 1.1B | RPi 5 | Ollama | 12.7 | 3,200 | 10.0 |
| TinyLlama | 1.1B | RPi 5 | Llamafile | 24.5 | 2,400 | 6.5 |
| LLaMA2-3B | 3B | Orange Pi 5 Pro | Ollama | 3.8 | 4,800 | 14.0 |
| LLaMA2-3B | 3B | Orange Pi 5 Pro | Llamafile | 12.1 | 3,300 | 8.4 |
4. Key Observations and Analysis
4.1 Absolute Versus Relative Positional Effects
- Absolute position sensitivity ("lost in the middle") is largely absent in state-of-the-art and scaled-up LLMs (recall ≳ 95% across segments 1–16), but persists in smaller open models such as Qwen-2.5-7B and WizardLM-2 (manifesting as U-shaped performance dips) (Tian et al., 2024).
- Relative position bias is universal: as mean inter-element distance increases—from adjacent to spread across half the context—recall drops 20–35 percentage points, even in leading commercial models (e.g., GPT-4o-mini, Claude-3-Haiku). This effect does not diminish as rapidly with parameter scaling, indicating that span-coattention robustness does not emerge solely through scale.
4.2 Impact of Query Placement and Model Scaling
Placing the query tokens at the head of context, rather than the tail, reduces both absolute and relative positional biases for decoder-only architectures. Scaling Qwen models from 7B to 72B dramatically mitigates , but only marginally improves severity.
4.3 SBC Inference Trade-Offs and Bottlenecks
- Llamafile delivers up to 4x higher throughput and 30–40% lower power draw compared to Ollama but scales poorly past 4 cores.
- Ollama scales throughput linearly with core count but at the expense of higher RAM and energy.
- On-device practical upper limits: Raspberry Pi 4 is restricted to ≤360M parameter models; Raspberry Pi 5 supports up to 1.5B parameters; Orange Pi 5 Pro can accommodate models up to 7B with reduced speed.
- Observed bottlenecks include memory bandwidth saturation and cache contention on high-core SMP SBCs, as well as RAM capacity limiting feasible model sizes.
5. Practical Recommendations
- Bias Mitigation: Architectural or algorithmic innovations (e.g., calibrated positional attention, explicit co-attention objectives, or memory-based retrieval layers) are likely required to address relative positional bias. Prompt engineering—such as placing queries at both ends, or using chunked context windows—offers partial amelioration.
- On-device Deployment:
- Use q4_k_m quantization for maximal efficiency.
- Restrict Llamafile to four high-performance cores; use all available CPU with Ollama if throughput is paramount.
- Monitor thermal and RAM constraints; employ active cooling for long or batch prompt sessions.
- Select board-model-runtime combinations keyed to operational goals (e.g., Raspberry Pi 5 for small/medium LLMs, Orange Pi 5 Pro for larger models).
6. Implications and Future Work
LongPiBench identifies that, although large-context LLMs have remedied "lost in the middle" failures for isolated fact retrieval, a more consequential bias persists when aggregating multiple, widely spaced facts—a key requirement in information synthesis and multi-step reasoning tasks. The lack of emergent mitigation solely through scale, and the pronounced drop in recall with increasing inter-element distance, indicate a need for new long-range attention mechanisms and optimization objectives tailored to span aggregation.
On the hardware front, LongPiBench demonstrates that cost-effective edge inference is viable for a large subset of use cases, contingent on appropriate quantization, runtime, and hardware choices. This suggests a rising practical role for local LLM deployment where throughput, privacy, or latency are crucial and massive models are not mandatory.
Open avenues include broader benchmarking of memory-augmented or retrieval-augmented transformer architectures, analysis of more diverse real-world corpora and prompt types, and further integration of bias-calibrated metrics into model selection and evaluation pipelines.
7. Public Accessibility and Community Impact
LongPiBench provides publicly released code, data, and detailed results, facilitating reproducible research and enabling the broader community to probe, compare, and mitigate context-related biases in LLMs. By decoupling and quantifying absolute from relative positional effects and mapping deployment realism on affordable hardware, it establishes a critical yardstick for future LLM development and evaluation (Tian et al., 2024, Tung et al., 20 Oct 2025).