Papers
Topics
Authors
Recent
Search
2000 character limit reached

PALM-Bench: Multi-Domain AI Benchmark

Updated 14 January 2026
  • PALM-Bench is a suite of benchmarks evaluating cloud server microarchitecture using cycle-accurate simulations to assess cache performance and security mechanisms.
  • It includes PalmBench, which benchmarks quantized LLM performance on mobile devices, highlighting trade-offs in memory efficiency, throughput, and generative quality.
  • The framework also targets personalized audio-language models by evaluating multi-speaker recognition, selective captioning, and profile-based reasoning for contextualized outputs.

PALM-Bench denotes a series of distinct yet influential benchmarks and frameworks emerging between 2016 and 2026, each targeting a different aspect of modern machine learning system evaluation and personalization. This entry covers the three principal and non-overlapping usages of the term: (1) PALMScloud/PALM-Bench for cloud server and microarchitecture benchmarking (Wu et al., 2016), (2) PalmBench for benchmarking compressed LLMs on mobile hardware (Li et al., 2024), and (3) PALM-Bench for evaluating Personalized Audio-LLMs (Wang et al., 7 Jan 2026). Each instance reflects evolving demands in system characterization, model compression, and personalized AI.

1. PALMScloud / PALM-Bench: Cloud Server Microarchitecture Benchmark

PALMScloud, also termed PALM-Bench, is a suite of purpose-built workloads for evaluating new hardware features—particularly cache architectures and security mechanisms—on cycle-accurate simulators (notably gem5) and on real or dual-node hardware (Wu et al., 2016). The core design tenets include representativeness (real-world cloud workloads), simulatability (parameterizable, open-source, rapid-boot server/client binaries), and extensibility.

Benchmark Architecture and Workload Suite

PALMScloud models a dual-node networked environment with each workload instantiated in a Linux VM, driven by a complementary client process, and communicating over full TCP/IP Ethernet. A Python configuration script provisions PCI networking, configures system topology, and ensures sim-hardware fidelity. The workload suite comprises:

  • Web Server (Apache httpd + ab), stressing CPU and I/O.
  • Database Server (MySQL OLTP + SysBench), exerting pressure on memory and locks through random R/W.
  • Mail Server (Postfix SMTP + Postal), mixing network and small-payload I/O.
  • File Server (Samba smbd + Dbench), exercising memory-mapped disk I/O.
  • Streaming Server (ffserver + openRTSP), for sustained streaming via network and memory.
  • Application Server (Tomcat JSP/Servlets + ab), targeting JVM and dynamic content.
  • Compute Server (LIBSVM + UCI Adult dataset), stress-testing floating-point/branching.
  • Idle Server, as a baseline for system noise.

Parameterization enables scaling input/concurrency to target specific hierarchy bottlenecks.

Metrics and Integration

PALMScloud collects granular metrics (all extractable from gem5 or client logs):

  • Throughput: Throughput=NreqTrun\mathrm{Throughput} = \frac{N_{\rm req}}{T_{\rm run}}
  • Average Latency: L‾=1Nreq∑i=1NreqLi\overline{L} = \frac{1}{N_{\rm req}}\sum_{i=1}^{N_{\rm req}} L_i
  • Cache Miss Rate: MissRate=NmissesNaccesses\mathrm{MissRate} = \frac{N_{\rm misses}}{N_{\rm accesses}}
  • MPKI, IPC, and design speedup, all in formal LaTeX notation.

Security-Performance Case Study: Newcache

A notable application is the Newcache secure cache study, introducing randomized index mappings (parameterized by k index bits) to resist cache side channels. Experiments reveal L1 D-cache miss rate and IPC vary negligibly (<1–2%) across k=0–6, showing throughput and client-observed latency remain statistically unchanged (speedup ≈ 0.99–1.01), confirming negligible security overhead in cloud settings.

Best Practices for Extension

  • Maintain clean VM/server images.
  • Use consistent network and service orchestration.
  • Match client/server addressing and network link parameters in simulations.
  • Validate new workloads via full-stack functional testing prior to instrumentation.
  • Containerize for real hardware to ensure workload isolation.
  • Always include an idle baseline to measure OS-induced noise, which can exceed 5–10% on lightweight benchmarks.

PALMScloud thus achieves a representative, extensible, and highly simulatible benchmarking platform for the rapid co-evaluation of hardware innovations and cloud service stacks (Wu et al., 2016).

2. PalmBench: Quantized LLM Benchmarking on Mobile and Edge Hardware

PalmBench is an automated, device-centric benchmark framework for resource- and quality-centric evaluation of compressed LLM inference on mobile devices and edge environments (Li et al., 2024). Its central focus is the real-world tradeoff between generative accuracy, execution efficiency, and harmful output, comparing multiple quantization schemes and hardware configurations.

Core Evaluation Metrics

  • Memory Footprint (M, MB): M=Total model bytes106M = \frac{\text{Total model bytes}}{10^6}
  • GPU/Accelerator Execution Time (t): t=tprefill+tdecodet = t_{\mathrm{prefill}} + t_{\mathrm{decode}}
  • Throughput (T, tokens/s): T=NtokenstT = \frac{N_{\rm tokens}}{t}
  • Energy Consumption (E, Joules or mAh): E=∫0tP(Ï„) dÏ„E = \int_0^t P(\tau)\,d\tau
  • Generative Quality (Q): EM/F1 (QA), BLEU, or task-specific metrics.
  • Harmful Output Rate (H): Fraction of hallucinated/toxic responses as scored by Perspective API/TET and LLM-judge.

Automated Benchmarking Pipeline

PalmBench orchestrates model deployment via MLC-LLM or llama.cpp to a standardized suite of mobile (Pixel, iPhone, Orange Pi) and edge (Jetson Nano) devices, instrumenting the measurement of latency, throughput, power, and memory using platform-native profilers and external hardware tools. Hallucination and toxicity rates are computed using external LLMs (GPT-4o, Claude-3.5) and test sets like HaluEval and TruthfulQA.

Quantization Schemes and Implementation

  • MLC: Supports q0f16 (fp16), q3f16, q4f16, q4f16_awq using TVM for kernel codegen.
  • llama.cpp: Implements K-Quant (2–6 bits), GPTQ-3/4 bits in GGUF format; vectorizes low-bit weight unpacking.
  • Quantization directly scales model size (sizeb=b16size16bit\text{size}_b = \frac{b}{16} \text{size}_{16\text{bit}}) and reduces memory bandwidth, with exercise-specific compute overhead for fine-grained bit-packing.

Empirical Findings

Bit-width EM Loss F1 Loss Hallucination Toxicity
16 0% 0% 7.5% 20.7
4 (AWQ) 3% 2% 8.9% 30.1
3 (GPTQ) 8% 6% 27.5% 64.1
2 (ggml) 15% 12% 34.7% 46.2
  • 4-bit quantization yields ~75% memory reduction, 40% throughput gain, ≤3% generative quality drop, and retains harmful output rates <10%.
  • Sub-4-bit quantization increases hallucination and toxicity sharply.
  • iOS Metal consistently exhibits higher throughput and energy efficiency than Android OpenCL (by ≈15–20%).
  • Device profiling is mandatory post-codegen/driver update—harmful output must be explicitly filtered for ≤3 bit models.

Recommended practice: use 4-bit PTQ on devices with ≥6GB RAM, fall back to ≤3-bit only when strictly necessary (humanities/QA tasks), and profile all deployments using PalmBench-provided pipeline (Li et al., 2024).

3. PALM-Bench: Personalized Audio-LLM Benchmark

PALM-Bench provides the first large-scale, task-structured benchmark for evaluating large audio-LLMs (LALMs) on personal context recognition, multi-speaker selective understanding, and reasoning anchored in user profiles (Wang et al., 7 Jan 2026). The motivation is the observed failure of generic LALMs to exhibit true personalized behavior, especially across multi-speaker and profile-dependent scenarios.

Task Formulation

Personalized audio-language modeling is formalized as structured sequence generation:

  • Inputs: Audio (A\mathcal{A}), Text Query (Q\mathcal{Q}), Profile (P\mathcal{P}).
  • Targets: Speaker sets (Saudio,Starget)(\mathcal{S}_{audio}, \mathcal{S}_{target}), output YY sampled from P(Y∣A,Q,P)P(Y | \mathcal{A}, \mathcal{Q}, \mathcal{P}).
  • Three subtasks:

    1. Concept Activation (recognition, binary per speaker).
    2. Selective Captioning (conditional summarization/refusal).
    3. Personalized Reasoning (recommendation, profile integration).

Losses are multitask-weighted: L=λrecLrec+λcapLcap+λprLpr\mathcal{L} = \lambda_{rec} \mathcal{L}_{rec} + \lambda_{cap} \mathcal{L}_{cap} + \lambda_{pr} \mathcal{L}_{pr}.

Dataset Curation and Statistics

Construction involves two pipelines:

  • Single-speaker: NCSSD corpus (27 speakers; Chinese/English), profiles and QA machine-generated then human-verified.

  • Multi-speaker: Mixed clips (2–4 speakers), with adversarial negatives using speaker similarity, template generation for robust evaluation.

Dataset scale: 2.6M samples, 5,626 h total, 227k unique clips, language split 56.5% Chinese, 43.5% English; strict train/test disjoint on speakers.

Task Suite and Evaluation Metrics

  • Recognition: Precision, Recall, F1, and LLMScore (LLM-judge).
  • Captioning: BLEU-4, BERTScore, LLMScore.
  • Reasoning: Captioning metrics + human scoring.

Baseline Models and Adaptation Strategies

Baselines span Kimi-Audio, Qwen2-Audio (7B), Qwen3-Omni (30B), MiDashengLM, Step-Audio 2. Training-free prompting (base, human/acoustic description, CoT) is contrasted with supervised adaptation:

  • Full-parameter fine-tuning (Full-FT).
  • LoRA (on FFN, attention, audio tower, all linear).
  • Prompt tuning (soft tokens).

Parameter selection: LoRA rank=8, LR≈1e-5, prompt tokens=16–32, epochs=5–15.

Experimental Results

  • Training-free prompting underperforms as multi-speaker complexity rises; explicit speaker/acoustic cues, especially AD, improve recognition but not always captioning.
  • Full-FT and LoRA (Attn+FFN) excel in multi-speaker settings (BLEU>70 at 4 spk), whereas prompt-tuning lags.
  • Single-to-multi transfer collapses; explicit multi-speaker supervision is essential.
  • LoRA variants on attention/FFN or all-linear best avoid catastrophic forgetting in general ASR/QA tasks.

Limitations and Future Work

  • Profiles remain static and simulated.
  • Only audio modality; no multimodal or dialogue integration.
  • Current methods, including LoRA and prompting, are insufficient for granular personalized concept transfer.

Research directions include dynamic memory-augmented architectures, multimodal fusion, retrieval-enhanced generation, and end-to-end fusion of speaker traits and semantic reasoning (Wang et al., 7 Jan 2026).

4. Comparative Overview

Benchmark Target Domain Core Focus Notable Metrics/Tasks
PALMScloud/PALM-Bench Cloud server hardware Hardware (cache, security), IaaS Throughput, latency, miss, IPC, etc.
PalmBench Mobile quantized LLMs Resource/security-quality tradeoffs Memory, energy, EM/F1/Hallucination
PALM-Bench (LALM) Personalized audio-language Contextualization & speaker tracking F1/BLEU/LLMScore/reasoning

Each PALM-Bench reflects a benchmark-driven approach to measuring underexplored stress points in system design (hardware, on-device ML, or context-aware LALMs), shaping best practices and research advances.

5. Significance and Influence

The PALM-Bench family, encompassing PALMScloud, PalmBench, and PALM-Bench for LALMs, documents a trajectory where benchmarking evolves from system/hardware realism (Wu et al., 2016), through compression/efficiency in edge AI (Li et al., 2024), to deep personalized modeling under heterogeneous, multi-agent settings (Wang et al., 7 Jan 2026). By establishing principled standards for representativeness, extensibility, and metric fidelity, these benchmarks structure rigorous experimentation—fortifying reproducibility and guiding model/hardware co-design. As the field progresses toward more personalized, distributed, and resource-constrained AI, such benchmarks are essential for transparent assessment and innovation scaffolding.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PALM-Bench.