PALM-Bench: Multi-Domain AI Benchmark
- PALM-Bench is a suite of benchmarks evaluating cloud server microarchitecture using cycle-accurate simulations to assess cache performance and security mechanisms.
- It includes PalmBench, which benchmarks quantized LLM performance on mobile devices, highlighting trade-offs in memory efficiency, throughput, and generative quality.
- The framework also targets personalized audio-language models by evaluating multi-speaker recognition, selective captioning, and profile-based reasoning for contextualized outputs.
PALM-Bench denotes a series of distinct yet influential benchmarks and frameworks emerging between 2016 and 2026, each targeting a different aspect of modern machine learning system evaluation and personalization. This entry covers the three principal and non-overlapping usages of the term: (1) PALMScloud/PALM-Bench for cloud server and microarchitecture benchmarking (Wu et al., 2016), (2) PalmBench for benchmarking compressed LLMs on mobile hardware (Li et al., 2024), and (3) PALM-Bench for evaluating Personalized Audio-LLMs (Wang et al., 7 Jan 2026). Each instance reflects evolving demands in system characterization, model compression, and personalized AI.
1. PALMScloud / PALM-Bench: Cloud Server Microarchitecture Benchmark
PALMScloud, also termed PALM-Bench, is a suite of purpose-built workloads for evaluating new hardware features—particularly cache architectures and security mechanisms—on cycle-accurate simulators (notably gem5) and on real or dual-node hardware (Wu et al., 2016). The core design tenets include representativeness (real-world cloud workloads), simulatability (parameterizable, open-source, rapid-boot server/client binaries), and extensibility.
Benchmark Architecture and Workload Suite
PALMScloud models a dual-node networked environment with each workload instantiated in a Linux VM, driven by a complementary client process, and communicating over full TCP/IP Ethernet. A Python configuration script provisions PCI networking, configures system topology, and ensures sim-hardware fidelity. The workload suite comprises:
- Web Server (Apache httpd + ab), stressing CPU and I/O.
- Database Server (MySQL OLTP + SysBench), exerting pressure on memory and locks through random R/W.
- Mail Server (Postfix SMTP + Postal), mixing network and small-payload I/O.
- File Server (Samba smbd + Dbench), exercising memory-mapped disk I/O.
- Streaming Server (ffserver + openRTSP), for sustained streaming via network and memory.
- Application Server (Tomcat JSP/Servlets + ab), targeting JVM and dynamic content.
- Compute Server (LIBSVM + UCI Adult dataset), stress-testing floating-point/branching.
- Idle Server, as a baseline for system noise.
Parameterization enables scaling input/concurrency to target specific hierarchy bottlenecks.
Metrics and Integration
PALMScloud collects granular metrics (all extractable from gem5 or client logs):
- Throughput:
- Average Latency:
- Cache Miss Rate:
- MPKI, IPC, and design speedup, all in formal LaTeX notation.
Security-Performance Case Study: Newcache
A notable application is the Newcache secure cache study, introducing randomized index mappings (parameterized by k index bits) to resist cache side channels. Experiments reveal L1 D-cache miss rate and IPC vary negligibly (<1–2%) across k=0–6, showing throughput and client-observed latency remain statistically unchanged (speedup ≈ 0.99–1.01), confirming negligible security overhead in cloud settings.
Best Practices for Extension
- Maintain clean VM/server images.
- Use consistent network and service orchestration.
- Match client/server addressing and network link parameters in simulations.
- Validate new workloads via full-stack functional testing prior to instrumentation.
- Containerize for real hardware to ensure workload isolation.
- Always include an idle baseline to measure OS-induced noise, which can exceed 5–10% on lightweight benchmarks.
PALMScloud thus achieves a representative, extensible, and highly simulatible benchmarking platform for the rapid co-evaluation of hardware innovations and cloud service stacks (Wu et al., 2016).
2. PalmBench: Quantized LLM Benchmarking on Mobile and Edge Hardware
PalmBench is an automated, device-centric benchmark framework for resource- and quality-centric evaluation of compressed LLM inference on mobile devices and edge environments (Li et al., 2024). Its central focus is the real-world tradeoff between generative accuracy, execution efficiency, and harmful output, comparing multiple quantization schemes and hardware configurations.
Core Evaluation Metrics
- Memory Footprint (M, MB):
- GPU/Accelerator Execution Time (t):
- Throughput (T, tokens/s):
- Energy Consumption (E, Joules or mAh):
- Generative Quality (Q): EM/F1 (QA), BLEU, or task-specific metrics.
- Harmful Output Rate (H): Fraction of hallucinated/toxic responses as scored by Perspective API/TET and LLM-judge.
Automated Benchmarking Pipeline
PalmBench orchestrates model deployment via MLC-LLM or llama.cpp to a standardized suite of mobile (Pixel, iPhone, Orange Pi) and edge (Jetson Nano) devices, instrumenting the measurement of latency, throughput, power, and memory using platform-native profilers and external hardware tools. Hallucination and toxicity rates are computed using external LLMs (GPT-4o, Claude-3.5) and test sets like HaluEval and TruthfulQA.
Quantization Schemes and Implementation
- MLC: Supports q0f16 (fp16), q3f16, q4f16, q4f16_awq using TVM for kernel codegen.
- llama.cpp: Implements K-Quant (2–6 bits), GPTQ-3/4 bits in GGUF format; vectorizes low-bit weight unpacking.
- Quantization directly scales model size () and reduces memory bandwidth, with exercise-specific compute overhead for fine-grained bit-packing.
Empirical Findings
| Bit-width | EM Loss | F1 Loss | Hallucination | Toxicity |
|---|---|---|---|---|
| 16 | 0% | 0% | 7.5% | 20.7 |
| 4 (AWQ) | 3% | 2% | 8.9% | 30.1 |
| 3 (GPTQ) | 8% | 6% | 27.5% | 64.1 |
| 2 (ggml) | 15% | 12% | 34.7% | 46.2 |
- 4-bit quantization yields ~75% memory reduction, 40% throughput gain, ≤3% generative quality drop, and retains harmful output rates <10%.
- Sub-4-bit quantization increases hallucination and toxicity sharply.
- iOS Metal consistently exhibits higher throughput and energy efficiency than Android OpenCL (by ≈15–20%).
- Device profiling is mandatory post-codegen/driver update—harmful output must be explicitly filtered for ≤3 bit models.
Recommended practice: use 4-bit PTQ on devices with ≥6GB RAM, fall back to ≤3-bit only when strictly necessary (humanities/QA tasks), and profile all deployments using PalmBench-provided pipeline (Li et al., 2024).
3. PALM-Bench: Personalized Audio-LLM Benchmark
PALM-Bench provides the first large-scale, task-structured benchmark for evaluating large audio-LLMs (LALMs) on personal context recognition, multi-speaker selective understanding, and reasoning anchored in user profiles (Wang et al., 7 Jan 2026). The motivation is the observed failure of generic LALMs to exhibit true personalized behavior, especially across multi-speaker and profile-dependent scenarios.
Task Formulation
Personalized audio-language modeling is formalized as structured sequence generation:
- Inputs: Audio (), Text Query (), Profile ().
- Targets: Speaker sets , output sampled from .
- Three subtasks:
- Concept Activation (recognition, binary per speaker).
- Selective Captioning (conditional summarization/refusal).
- Personalized Reasoning (recommendation, profile integration).
Losses are multitask-weighted: .
Dataset Curation and Statistics
Construction involves two pipelines:
Single-speaker: NCSSD corpus (27 speakers; Chinese/English), profiles and QA machine-generated then human-verified.
- Multi-speaker: Mixed clips (2–4 speakers), with adversarial negatives using speaker similarity, template generation for robust evaluation.
Dataset scale: 2.6M samples, 5,626 h total, 227k unique clips, language split 56.5% Chinese, 43.5% English; strict train/test disjoint on speakers.
Task Suite and Evaluation Metrics
- Recognition: Precision, Recall, F1, and LLMScore (LLM-judge).
- Captioning: BLEU-4, BERTScore, LLMScore.
- Reasoning: Captioning metrics + human scoring.
Baseline Models and Adaptation Strategies
Baselines span Kimi-Audio, Qwen2-Audio (7B), Qwen3-Omni (30B), MiDashengLM, Step-Audio 2. Training-free prompting (base, human/acoustic description, CoT) is contrasted with supervised adaptation:
- Full-parameter fine-tuning (Full-FT).
- LoRA (on FFN, attention, audio tower, all linear).
- Prompt tuning (soft tokens).
Parameter selection: LoRA rank=8, LR≈1e-5, prompt tokens=16–32, epochs=5–15.
Experimental Results
- Training-free prompting underperforms as multi-speaker complexity rises; explicit speaker/acoustic cues, especially AD, improve recognition but not always captioning.
- Full-FT and LoRA (Attn+FFN) excel in multi-speaker settings (BLEU>70 at 4 spk), whereas prompt-tuning lags.
- Single-to-multi transfer collapses; explicit multi-speaker supervision is essential.
- LoRA variants on attention/FFN or all-linear best avoid catastrophic forgetting in general ASR/QA tasks.
Limitations and Future Work
- Profiles remain static and simulated.
- Only audio modality; no multimodal or dialogue integration.
- Current methods, including LoRA and prompting, are insufficient for granular personalized concept transfer.
Research directions include dynamic memory-augmented architectures, multimodal fusion, retrieval-enhanced generation, and end-to-end fusion of speaker traits and semantic reasoning (Wang et al., 7 Jan 2026).
4. Comparative Overview
| Benchmark | Target Domain | Core Focus | Notable Metrics/Tasks |
|---|---|---|---|
| PALMScloud/PALM-Bench | Cloud server hardware | Hardware (cache, security), IaaS | Throughput, latency, miss, IPC, etc. |
| PalmBench | Mobile quantized LLMs | Resource/security-quality tradeoffs | Memory, energy, EM/F1/Hallucination |
| PALM-Bench (LALM) | Personalized audio-language | Contextualization & speaker tracking | F1/BLEU/LLMScore/reasoning |
Each PALM-Bench reflects a benchmark-driven approach to measuring underexplored stress points in system design (hardware, on-device ML, or context-aware LALMs), shaping best practices and research advances.
5. Significance and Influence
The PALM-Bench family, encompassing PALMScloud, PalmBench, and PALM-Bench for LALMs, documents a trajectory where benchmarking evolves from system/hardware realism (Wu et al., 2016), through compression/efficiency in edge AI (Li et al., 2024), to deep personalized modeling under heterogeneous, multi-agent settings (Wang et al., 7 Jan 2026). By establishing principled standards for representativeness, extensibility, and metric fidelity, these benchmarks structure rigorous experimentation—fortifying reproducibility and guiding model/hardware co-design. As the field progresses toward more personalized, distributed, and resource-constrained AI, such benchmarks are essential for transparent assessment and innovation scaffolding.