Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI (2511.07885v1)

Published 11 Nov 2025 in cs.DC, cs.AI, cs.CL, and cs.LG

Abstract: LLM queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals $3$ findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.

Summary

  • The paper introduces intelligence per watt (IPW) as a unified metric to assess the energy-accuracy trade-offs of local AI models.
  • The paper employs hardware-software co-design, automated telemetry, and extensive real-world benchmarks to compare local LMs with cloud accelerators.
  • The paper demonstrates significant improvements, including 5.3x efficiency gains and notable resource savings through intelligent routing and quantization.

Intelligence per Watt: Metricization and Analysis of Local AI Efficiency

The paper "Intelligence per Watt: Measuring Intelligence Efficiency of Local AI" (2511.07885) provides a rigorous, system-level framework for quantifying the efficiency of local LLM inference using the unified metric of intelligence per watt (IPW). It delivers comprehensive empirical answers to the viability of small local LLMs (≤20B parameters) running on local accelerators as a scalable complement or alternative to frontier cloud inference, especially under constrained power and cost scenarios. The work is grounded in hardware-software co-design methodology, automatic telemetry instrumentation, and naturalistic benchmarking against 1M real-world LLM queries.

Motivation and Metric Formalization

The exponential growth of LLM inference workloads imposes acute stress on centralized cloud infrastructure, with datacenter energy, hardware, and capital expenditures forecast to scale to unprecedented levels by 2030. Inference workloads, not just training, increasingly dominate resource allocation. The practical retrenchment toward local AI is enabled by two convergent trends: open-access small LMs approach frontier model performance on core benchmarks, and consumer/edge hardware (Apple M4 Max, AMD Ryzen AI, etc.) now delivers sufficient memory and compute for interactive LAN inference with moderate power draw.

IPW is formally defined as

IPW(m,h)=EqQ[acc(m,q)]EqQ[P(m,h,q)]\mathrm{IPW}(m, h) = \frac{\mathbb{E}_{q \sim Q}[\text{acc}(m, q)]}{\mathbb{E}_{q \sim Q}[P(m, h, q)]}

where mm is model, hh is hardware, qq is query, and PP denotes instantaneous power consumption. APJ and PPJ metrics extend to per-query energy and perplexity-based efficiency accounting.

Large-Scale Benchmarking and Profiling Harness

Experimental Setup

  • Models: 20+ open/local LMs, including QWEN3, GPT-OSS, GEMMA3, IBM GRANITE, and validated SOTA closed models (GPT-5, GEMINI 2.5 PRO, CLAUDE SONNET 4.5).
  • Accelerators: 8+ hardware backends, local (Apple, AMD consumer, NVIDIA RTX) and cloud/enterprise (NVIDIA B200, SambaNova SN40L), spanning 40–768 GB memory and 145–1000 W TDP.
  • Tasks: 1M queries (WILDCHAT, NATURALREASONING, MMLU PRO, SUPERGPQA), covering chat, reasoning, expert knowledge, and economic breadth.
  • Instrumentation: Cross-platform harness sampling power, energy, latency, memory, and throughput at high temporal resolution (50ms). NVML, ROCm SMI, and powermetrics used for native telemetry.

Labeling and Evaluation

  • Query domain: Annotated using GPT-4O-MINI against the Anthropic Economic Index taxonomy (22 labor categories).
  • Accuracy: LLM-as-a-judge rubric (multi-criterion for subjective/chat, correctness-only for technical).
  • Model outputs: Standardized decoding (T=0.6, top-p=0.95, top-k=20), up to 32768 tokens.

Key Results and Empirical Findings

Task Coverage and Model Viability

  • Coverage: Local LMs can accurately respond to 88.7% of single-turn chat and reasoning queries. Domain variance is strong (≥90% for creative-media queries, ≈68% for technical domains like Engineering).
  • Ensembling and routing: Best-of-local ensemble routing lifts coverage further, capitalizing on model complementarity across domains. Individual LMs (GPT-OSS-120B) achieve ≈71.4% coverage.
  • Longitudinal trend: Local LMs' ability to match frontier models on queries rose from 23.2% (2023) to 48.7% (2024), 71.3% (2025) – a 3.1x increase in two years.

Intelligence Efficiency Over Time

  • IPW Gain: Compound efficiency gains from hardware and model improvements total 5.3x (2023–2025) in accuracy per watt.
    • Model improvement (MIXTRAL-8x7B→GPT-OSS-120B): 3.1x
    • Accelerator improvement (NVIDIA H100→Blackwell): 1.7x
  • Hardware gap: Cloud accelerators maintain a substantial instantaneous and end-to-end efficiency advantage (Apple M4 Max: 1.40x lower IPW, 1.6–2.3x lower IPJ compared to NVIDIA B200).
  • Quantization: FP8/FP4 quantization yields minor accuracy loss (2–3%/step), but 3x+ energy savings, supporting low-bit deployment for most chat/reasoning applications.

Resource Savings via Hybrid Local-Cloud Routing

  • Oracle routing: Perfect query-to-model assignment achieves 80.4% energy, 77.3% compute, and 73.8% cost savings over cloud-only deployment.
  • Practical routers: Even with 80% routing accuracy, 64.3% energy, 61.8% compute, and 59.0% cost reductions are maintained, with no loss of answer quality by fallback-to-cloud on misassignments.
  • Domains: Local handling is dominant for creative/social tasks (>90% coverage), but challenges remain in technical domains (40–68% solvability for Architecture, Science).

Comparison: Open-Source vs. Closed-Source Frontier Models

  • Open-source frontier-scale models (QWEN3-235B-A22B) approach closed-source performance: 71.8% vs. 77.9% average accuracy. On expert benchmarks, the gap narrows to 3.4–5.1%, but widens to >12% on naturalistic reasoning tasks. With ≤20B parameter constraints typical of local deployment, accuracy penalty increases to ≈11–13% on SOTA closed models.

Economic Implications

  • GDP-weighted accuracy shows scaling local LM performance translates directly into economic relevance: GPT-OSS-120B covers 69.6% of US GDP in chat tasks, QWEN3-235B reaches 23.3–31.9% GDP coverage for advanced reasoning/science. Improving model capabilities for technical queries would unlock substantial economic automation opportunities.

Implementation Tradeoffs and System Considerations

  • Model selection: Deploying multiple complementary local LMs and intelligent routing is markedly superior to static assignment and tolerates moderate routing errors.
  • Accelerator selection: Unified memory consumer hardware enables competitive performance for most tasks at moderate energy cost, but high-throughput, highly parallel cloud accelerators remain essential for latency- and energy-critical workloads. The efficiency gap justifies ongoing hardware optimization (HBM3e, edge TPUs, tensor cores).
  • Quantization: FP8/FP4 is optimal for energy-constrained deployment when minor accuracy loss is acceptable.
  • Telemetry/monitoring: High-resolution, vendor-native, multi-hardware profiling is necessary for reproducible energy benchmarking.
  • Routing: Query-level domain-aware routers, preference-trained on per-query outcome, provide best trade-off in hybrid systems.

Theoretical and Practical Implications

The IPW metric offers a unified efficiency criterion supporting direct cross-comparison of model-hardware configurations and longitudinal progress mapping. These results empirically substantiate the continuous improvement trajectory of local inference systems, indicating local LMs and consumer accelerators can address a substantial—and rapidly expanding—subset of real-world LLM queries. This portends accelerated adoption of distributed, hybrid inference architectures mitigating the long-term resource intensiveness of centralized AI serving.

From a systems research point of view, further optimization of local inference involves:

  • Tight co-design of model architectures and accelerators for higher FLOPs/W
  • Advanced quantization and low-rank adaptation for memory/power reductions
  • Improved real-time query routing and model ensembling for completeness-optimized coverage
  • Standardized, open telemetry aggregation for continual benchmarking as hardware and models evolve
  • Exploration of collaborative token-wise protocols to further blur the boundary between local and cloud compute

Conclusions

"Intelligence per Watt: Measuring Intelligence Efficiency of Local AI" establishes intelligence per watt as the critical metric for tracking the transition from cloud-centric toward hybrid and eventually predominantly local LLM inference. Through systematic benchmarking, it demonstrates that local small LMs, when intelligently routed and deployed on contemporary consumer hardware, already deliver highly competitive accuracy and pronounced resource savings for the majority of single-turn chat and reasoning queries—albeit with remaining limitations in technically demanding domains. The efficiency gap to cloud accelerators justifies continued hardware specialization for local AI workloads. As both accelerator hardware and LM architectures evolve, routing strategies and quantization methods offer substantial leverage in closing the gap, with direct implications for AI energy economics and large-scale system deployment. The profiling harness released enables ongoing progress tracking and reproducibility within the field. Future work should further investigate dynamic, context-adaptive routing, model augmentation protocols, and optimization at the memory-controller and logic-gate level for maximal intelligence-per-watt efficiency.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question: can small AI models running on your own device (like a laptop or phone) handle lots of everyday questions without needing huge cloud servers? To answer that, the authors introduce a new yardstick called “intelligence per watt” (IPW). It’s like miles-per-gallon for AI: how much useful, correct work an AI does for each unit of power it uses.

Their goal is to see if local AI (small models on local chips) is both smart enough and efficient enough to take on a big chunk of today’s AI requests, which are usually sent to giant models in data centers.

The big questions the researchers asked

The paper focuses on three easy-to-understand questions:

  • How many real questions today can be correctly answered by small, local AI models running on local chips?
  • Is “intelligence per watt” getting better over time, and how much is that due to smarter models vs. better hardware?
  • If we split work between local devices and cloud servers smartly, how much energy, compute, and cost can we save?

How did they paper it?

The authors ran a huge test using more than 1 million real questions and problems. They tried over 20 modern small AI models on 8 kinds of computer chips (some local, like Apple’s M4 Max, and some cloud, like NVIDIA’s B200). For each question, they measured:

  • Accuracy: did the model give the right answer or at least match a top “frontier” model?
  • Power and energy: how much electricity did it use (watts and joules)?
  • Speed: how long did it take to respond (latency)?
  • Other stats: like memory and compute used.

Here are the key terms in everyday language:

  • LLM: a computer program that reads and writes text and answers questions.
  • Local model: a smaller LLM that can run on your device (usually up to about 20 billion “active parameters,” which you can think of as how many “knobs” the model has to make decisions).
  • Cloud model: a very big LLM running on powerful data-center servers.
  • Accelerator: a special chip (like a GPU or neural engine) that makes AI run fast.
  • Watt: a measure of power (how fast you use energy), like how bright a bulb is.
  • Joule: a measure of total energy used over time (power × time).
  • Intelligence per watt (IPW): accuracy divided by average power used. Higher IPW means you get more correct answers for the same electricity.
  • Routing: deciding which model should answer each question—send easy questions to small local models, and only send hard ones to big cloud models.
  • Single-turn query: one question and one answer (not an ongoing multi-step conversation).

What did they find?

In short, local AI is already capable and keeps getting more efficient.

Here are the main results:

  • Local models can answer most questions: by October 2025, small local models could correctly handle 88.7% of single-turn chat and reasoning queries. Performance depends on topic—creative tasks pass 90%, but highly technical areas (like engineering) are lower (around 68%).
  • Big progress over time: intelligence per watt improved about 5.3× from 2023 to 2025. Over the same period, the share of queries where the best local model matched big cloud models jumped from 23.2% to 71.3%.
  • Both models and hardware matter: the efficiency gains come from smarter model designs (about 3.1× better) and improved accelerators (about 1.7× better).
  • Local chips vs. cloud chips: today’s cloud accelerators are still more power- and energy-efficient than local ones when running the same model (often 1.4× to 2.3× better for “per watt” or “per joule” metrics, and even more in some cases). That means there’s room to improve local hardware.
  • Smart splitting saves a lot: if you route questions so that local models handle what they can, and only send tough ones to the cloud, you can cut energy, compute, and cost dramatically:
    • With perfect routing, energy drops ~80%, compute ~77%, cost ~74%.
    • Even with a realistic router that’s right 80% of the time, you still save ~64% energy, ~62% compute, and ~59% cost—without hurting answer quality (because misrouted questions fall back to a big cloud model).

Why this matters: data centers are under pressure from rapidly growing AI demand. If local devices take on a big chunk of work efficiently, we can reduce strain on power grids, lower costs, and still get fast, good answers.

Why does this matter?

This paper suggests a practical shift in how we use AI:

  • Less dependence on giant data centers: local AI can handle a large share of everyday questions, especially creative and conversational ones.
  • Greener and cheaper AI: better “intelligence per watt” means more useful answers for less energy and cost, which helps companies and the environment.
  • Better user experience: local models can run with interactive speed on modern consumer chips, improving privacy and reducing delays.
  • Clear metric for progress: IPW gives a simple, meaningful way to track how much smarter and more efficient AI becomes over time, across different models and chips.

Bottom line: local AI is becoming both capable and efficient enough to take over many tasks. By using “intelligence per watt” to measure progress and by smartly routing queries between local and cloud, we can make AI faster, cheaper, and more sustainable—without sacrificing quality.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains uncertain or unexplored in the paper that future work could address.

  • Metric definition clarity: The formal definition of accuracy per watt (APW) appears inverted in the text (Eq[P] * Eq[acc] vs Eq[acc]/Eq[P]); the aggregation scheme across tasks (weights by dataset size vs equal weighting) and across queries (macro vs micro averaging) is not specified, hindering reproducibility and cross-paper comparability.
  • Normalization for prompt/response length: IPW/APJ results are not normalized by input/output token lengths or broken down per-token, making it hard to disentangle model capability from length-driven energy and latency effects.
  • Phase-level energy breakdown: The paper does not separate prefill vs decode energy/latency, leaving open how different architectures, sequence lengths, and cache behaviors affect energy per token and IPW.
  • Variance and statistical uncertainty: No confidence intervals, bootstrapping, or repeat-run variance are reported; the stability of IPW estimates across runs, seeds, and hardware samples is unknown.
  • LLM-as-a-judge dependence: Heavy reliance on LLM judges without human evaluation leaves open questions about judge bias, calibration, and robustness across judges; inter-judge agreement and sensitivity to prompt templates are not reported.
  • Reference-answer bias in WildChat: Using Qwen3-235B outputs as “ground truth” introduces model-family bias and could inflate agreement-based accuracy; robustness to alternative references (human gold, other frontier models) is not explored.
  • Task and modality coverage: Only single-turn, text-only chat and reasoning are evaluated; multi-turn dialogue, tool use/agents, code execution, RAG, long-context tasks, and multimodal inputs remain unassessed for local IPW.
  • Language and domain generalization: Non-English queries, low-resource languages, and domain-specific corpora (e.g., law, medicine, code-heavy workloads) are not covered; the extent to which findings transfer across languages and domains is unknown.
  • Failure-mode analysis: The paper identifies weaker coverage in technical fields but lacks qualitative/quantitative error analyses to pinpoint failure modes or guide targeted model/route improvements.
  • Data contamination controls: There is no decontamination analysis to assess potential training-test leakage (especially for public benchmarks and real-world prompts), which could inflate apparent “local capability.”
  • Category labeling noise: Economic category annotations are produced by GPT-4o-mini; their accuracy and impact on domain-level conclusions are unvalidated.
  • Batch size and serving realism: All results use batch size 1; IPW under realistic serving (batching, multi-tenant contention, speculative decoding, paged attention, KV cache sharing) is not measured, potentially misrepresenting cloud vs local advantages.
  • Thermal throttling and sustained load: The paper does not analyze thermal behavior, throttling, or performance drift during prolonged local usage (e.g., laptop thermals, battery vs plugged-in states).
  • Energy measurement scope comparability: Device-level power excludes datacenter overheads (PUE, cooling, networking, storage), and local measurements may omit CPU, DRAM, display, and system-level overhead; apples-to-apples, end-to-end energy accounting remains unresolved.
  • Network and routing overheads: The energy/latency costs of routing (feature extraction, model scoring, verification steps, fallback logic, network egress) are not included, yet are critical for real deployments.
  • Practical routing feasibility: The “80% accurate router” is simulated, not implemented; how to build such a router (predictors, signals, training data, online evaluation) and its real performance and overhead remain open.
  • “Default to cloud” assumption: Simulations assume misrouted queries are detected and seamlessly rerouted to cloud without quality loss; mechanisms to detect failure and the extra energy/latency of second-pass reruns are unspecified.
  • “Active parameters ≤ 20B” criterion: The definition and verification of “active parameters” (especially for MoE models like GPT-OSS-120B) are not operationalized; per-token expert activation counts and variability are unreported.
  • Software stack and kernel parity: Differences in inference stacks (CUDA/cuDNN vs Metal vs ROCm), kernels, attention implementations, quantization, and parallelism strategies are not standardized or ablated, confounding hardware-versus-model attributions.
  • Quantization and precision trade-offs: While mentioned in the appendix, the main results don’t systematically map accuracy/energy trade-offs across dtypes/quantization levels, leaving practical guidance incomplete.
  • Cost modeling transparency: Cost reductions are reported without detailed assumptions (hardware amortization, electricity prices, utilization rates, cloud pricing models, maintenance), limiting transferability to other settings.
  • Carbon impact and lifecycle: Results focus on energy but not emissions (grid carbon intensity, temporal marginal emissions) or embodied carbon of hardware; environmental conclusions cannot be drawn.
  • Cloud measurement reproducibility: How cloud-side device power and energy were measured (and whether site-level overheads were included) is not fully specified; third-party replication may be difficult.
  • Multi-device and edge heterogeneity: Local results are centered on a single flagship SoC (Apple M4 Max); heterogeneity across consumer devices (NPUs, mid-range GPUs, mobile SoCs) and their IPW is unexplored.
  • Memory pressure and capacity constraints: The impact of memory limits (KV cache growth, long-context eviction, unified memory contention) on IPW and coverage, especially on lower-memory local devices, is not analyzed.
  • Longitudinal attribution rigor: The decomposition of IPW gains into “model” vs “hardware” contributions mixes different years, stacks, and devices; controlled ablations (same software stack, same prompts, same tokenization) are needed to isolate causal contributions.
  • Security, safety, and policy effects: Safety filtering, toxicity mitigation, and privacy-preserving mechanisms can alter latency and energy; their impact on IPW is not measured.
  • Dataset release and replication: Although the harness is released, cleaned datasets, outputs, and raw telemetry necessary to replicate aggregate IPW numbers are not clearly stated as available.
  • User-centric metrics: There is no direct linkage between IPW and user experience (preference, satisfaction, perceived latency thresholds); how IPW correlates with real-world QoE remains open.
  • Cross-workload routing policies: The paper assumes “smallest capable model” routing; alternative objectives (minimize latency, cost ceilings, privacy constraints, battery conservation) and their trade-offs for IPW are unstudied.
  • Per-domain routing policies: Given domain-specific coverage gaps, whether domain-aware routers can close technical-discipline deficits (and at what energy cost) is not evaluated.
  • Robustness to prompt distribution shifts: The representativeness of one-month WildChat prompts and the generalization of routing/IPW conclusions to evolving enterprise and agentic workloads are untested.
  • Impact of speculative/assisted decoding: Techniques like speculative decoding, grammar-constrained decoding, or verifier models may change IPW substantially; their effects on local vs cloud parity are unquantified.
  • Privacy and governance constraints: How data residency and privacy requirements (favoring local inference) interact with IPW-optimized routing is not modeled.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the paper’s IPW metric, empirical findings (e.g., 88.7% local coverage on single-turn queries), and the released cross-platform profiling harness.

  • Hybrid local–cloud routing middleware for LLM workloads
    • Sector: software/SaaS, customer support, education, productivity platforms
    • Tools/products/workflows: IPW-aware router that assigns queries to the smallest capable local model first and falls back to a frontier cloud model; SDKs that integrate the paper’s profiling harness; dashboards tracking APW/APJ
    • Assumptions/dependencies: Router reaches ~80% assignment accuracy; local devices host ≤20B-parameter models with interactive latency; quality preserved via cloud fallback
    • Potential impact: 40–65% reductions in energy, compute, and cost at 80% routing accuracy; up to ~80% savings under oracle routing
  • On-device assistants for single-turn chat, writing, and information tasks
    • Sector: consumer software, productivity apps, enterprise knowledge workers
    • Tools/products/workflows: Local-first assistants on Apple M4 Max/AMD Ryzen AI; “fallback-to-cloud” workflows for harder queries; battery/thermal-aware inference modes
    • Assumptions/dependencies: Local model coverage remains high for conversational/creative tasks (≥90% in creative fields); interactive latency on device
    • Potential impact: 88.7% of single-turn chat and reasoning queries handled locally; improved responsiveness and privacy
  • Developer copilots with local-first inference
    • Sector: software engineering
    • Tools/products/workflows: IDE copilots that attempt local code search/explanations, unit-test writing, doc drafting, with automatic cloud escalation for complex reasoning
    • Assumptions/dependencies: Technical tasks have lower local coverage (e.g., Architecture & Engineering at ~60%); LLM-as-a-judge or test-based verification ensures quality
    • Potential impact: Reduced compute and cost for routine developer tasks, better availability in low-connectivity environments
  • Privacy-preserving clinical documentation and admin support
    • Sector: healthcare
    • Tools/products/workflows: Local scribing, template generation, and note summarization on clinician laptops; default cloud fallback for complex medical reasoning
    • Assumptions/dependencies: HIPAA compliance, robust PHI handling; domain-specific evaluation for safety; human-in-the-loop validation
    • Potential impact: Privacy benefits by keeping routine text generation on-device; lower operational costs
  • Offline-first paper aids and tutoring for schools
    • Sector: education
    • Tools/products/workflows: Local Q&A and writing feedback on student laptops; classroom routers that prefer local models; content safety filters
    • Assumptions/dependencies: Stronger performance in humanities/creative tasks than STEM reasoning; pedagogical validation and equity considerations
    • Potential impact: Wider access to AI support without constant internet; cost-effective deployments in bandwidth-constrained settings
  • MLOps efficiency dashboards and CI/CD gates using IPW
    • Sector: platform engineering, ML operations
    • Tools/products/workflows: Integrate the profiling harness to track APW/APJ across releases; set deployment gates for energy/latency; detect regressions
    • Assumptions/dependencies: Consistent telemetry across hardware; reliable LLM-as-a-judge configurations
    • Potential impact: Predictable efficiency improvements and cost control; actionable benchmarking across model-accelerator pairs
  • Hardware procurement and capacity planning via IPW
    • Sector: IT operations, cloud/edge infrastructure
    • Tools/products/workflows: Run IPW benchmarks to choose accelerators and model sizes per workload; plan mixed fleets (local devices + cloud)
    • Assumptions/dependencies: Benchmarks represent real workloads; clear TCO models; awareness of efficiency gaps (local vs. cloud)
    • Potential impact: Better price-performance; informed capex/opex decisions; reduced emissions
  • Energy/ESG reporting and carbon accounting for AI inference
    • Sector: sustainability, corporate reporting
    • Tools/products/workflows: Convert APJ telemetry into energy and CO2 estimates; track reductions from local-first routing; integrate into ESG dashboards
    • Assumptions/dependencies: Accurate power metering; standardized carbon factors; audited measurement processes
    • Potential impact: Documented 40–80% energy savings depending on routing quality and workload mix
  • Adaptive mobile/edge apps that optimize for battery and cost
    • Sector: mobile, consumer electronics
    • Tools/products/workflows: Battery-aware inference modes; dynamic switching between local and cloud; transparent user controls
    • Assumptions/dependencies: Device support for small LMs; acceptable latency; connectivity for fallbacks
    • Potential impact: Longer battery life and lower data costs while maintaining quality
  • Datacenter load shedding and peak smoothing via client offload
    • Sector: cloud infrastructure, energy management
    • Tools/products/workflows: Orchestration that pushes eligible workloads to local devices during peaks; SLA-bound fallbacks to cloud
    • Assumptions/dependencies: Client acceptance, security of edge execution; predictable quality via routing
    • Potential impact: Reduced peak power demand and cooling costs; improved resilience
  • Academic reproducible benchmarking using the released harness
    • Sector: academia, research labs
    • Tools/products/workflows: Cross-platform harness to measure latency, energy, TTFT, APW/APJ; compare models and accelerators over time
    • Assumptions/dependencies: Community adoption; shared datasets and evaluation protocols
    • Potential impact: Standardized efficiency tracking and faster iteration on model/hardware advances

Long-Term Applications

The following applications will require further research, scaling, standardization, and/or development before broad deployment.

  • IPW-driven standards and certification for AI efficiency
    • Sector: policy, compliance, cloud procurement
    • Tools/products/workflows: Industry-wide APW/APJ baselines; “IPW-certified” labels for devices/apps; SLAs that include intelligence efficiency
    • Assumptions/dependencies: Multi-stakeholder consensus on metrics and audits; regulatory uptake
    • Potential impact: Transparency in AI energy use; market incentives for efficient models/hardware
  • Next-generation local accelerators that close the efficiency gap
    • Sector: semiconductors, device OEMs
    • Tools/products/workflows: Specialized on-device AI components (e.g., HBM3e-class memory, tensor cores, improved memory hierarchies)
    • Assumptions/dependencies: Design cycles and fabrication capacity; thermal/power constraints on consumer devices
    • Potential impact: Narrowing 1.4–7.4x APJ gap vs. cloud accelerators; higher local IPW and broader on-device model sizes
  • Advanced routers with >90% accuracy and cost/carbon-aware policies
    • Sector: software platforms, observability
    • Tools/products/workflows: Semantic/task classifiers, risk-aware decision-making, learning-to-route with feedback loops; privacy-preserving local classification
    • Assumptions/dependencies: High-quality labeled datasets; robust evaluation (including LLM-as-a-judge reliability); guardrails for misrouting
    • Potential impact: Near-oracle savings with maintained quality; fine-grained control over cost/latency/emissions
  • Carbon-aware scheduling and demand-response for AI inference
    • Sector: energy, cloud operations
    • Tools/products/workflows: Grid signal integration; shift flexible inference to renewable-heavy windows and local devices; carbon budgets for AI services
    • Assumptions/dependencies: Access to real-time grid carbon intensity; SLAs tolerant to scheduling; standardized carbon accounting
    • Potential impact: Significant emission reductions at scale; alignment of AI demand with clean supply
  • Expansion to multi-turn, multi-agent, and multimodal (vision/speech) workloads
    • Sector: healthcare, robotics, media, education
    • Tools/products/workflows: Local-first pipelines for dialog, tool use, and multimodal tasks; early-exit and collaboration protocols across edge/cloud
    • Assumptions/dependencies: Larger context windows on device; better local multimodal models; robust evaluation beyond single-turn
    • Potential impact: Broader offloadable workload share; improved responsiveness for complex interactions
  • Domain-specific small models for technical fields
    • Sector: engineering, finance, life sciences
    • Tools/products/workflows: Curated pretraining and post-training to lift local coverage in Architecture & Engineering, math, and specialized sciences
    • Assumptions/dependencies: Access to domain datasets; expert evaluation; safety checks
    • Potential impact: Raise local coverage (currently ~60% in technical domains) toward parity with creative/humanities fields
  • Commercialization of edge–cloud collaboration protocols
    • Sector: telecom, cloud providers, systems software
    • Tools/products/workflows: Token/layer partitioning (speculative decoding, early exit) in production; standardized protocols (e.g., Minions-like)
    • Assumptions/dependencies: Bandwidth/latency guarantees; developer tooling; security of intermediate states
    • Potential impact: Lower latency/cost while preserving quality; efficient use of heterogeneous resources
  • Privacy-by-design frameworks for sensitive sectors
    • Sector: healthcare, finance, government
    • Tools/products/workflows: Local-first inference policies, audit trails, formal privacy guarantees; continuous evaluation on sensitive tasks
    • Assumptions/dependencies: Legal/regulatory alignment; standardized audits; red-teaming for safety
    • Potential impact: Wider adoption of local AI for sensitive data; reduced compliance risk
  • National and municipal policies incentivizing local inference
    • Sector: public policy, economic development
    • Tools/products/workflows: Tax credits or procurement rules favoring high-IPW deployments; datacenter permitting tied to hybrid offload strategies
    • Assumptions/dependencies: Evidence base (e.g., 60–80% savings with realistic routers); stakeholder engagement; grid capacity constraints
    • Potential impact: Reduced infrastructure strain and capex; more equitable access to AI
  • IPW labeling for consumer ecosystems
    • Sector: consumer electronics, app marketplaces
    • Tools/products/workflows: “IPW-certified” device/app labels; user-facing energy/latency scores; settings for efficiency vs. speed
    • Assumptions/dependencies: Trusted measurement pipelines; UX norms for energy transparency
    • Potential impact: Consumer choice aligned with sustainability and performance
  • Autonomous MLOps autopilot
    • Sector: devops, platform engineering
    • Tools/products/workflows: Continuous IPW benchmarking, automatic model selection, routing, rollback; policy-driven deployment governance
    • Assumptions/dependencies: Integration across toolchains; guardrails to prevent regressions; explainability for ops teams
    • Potential impact: Faster iteration with controlled energy/cost envelopes
  • Resilience and emergency deployments (offline AI)
    • Sector: public safety, disaster response
    • Tools/products/workflows: Local-only inference kits for field teams; robust routing policies without connectivity
    • Assumptions/dependencies: Suitable models and hardware; training for responders; content safety in critical contexts
    • Potential impact: Continuity of AI support during outages; reduced dependency on fragile infrastructure
  • Education efficacy research and scaled adoption
    • Sector: education
    • Tools/products/workflows: RCTs evaluating offline tutors; curricular integration; teacher tooling
    • Assumptions/dependencies: Ethical frameworks; measurement of learning outcomes; equitable deployment
    • Potential impact: Evidence-backed rollouts of local AI tutoring and feedback at scale
  • Finance and TCO modeling that internalizes IPW
    • Sector: finance, corporate strategy
    • Tools/products/workflows: Incorporate APW/APJ into TCO, pricing, and budgeting; scenario planning for hybrid fleets
    • Assumptions/dependencies: Adoption of standardized metrics; cross-functional alignment
    • Potential impact: Better capital allocation, predictable inference costs, and emissions targets
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Accuracy per joule: A metric quantifying task accuracy normalized by total energy consumed per query. "including accuracy per joule and perplexity-based measurements presented in Figure 1 (App. C.2)"
  • Accuracy per watt: A metric quantifying task accuracy per unit of instantaneous power draw. "Accuracy per watt has improved over 5 x in two years, driven by advances in both model architectures (from MIXTRAL-8x7B to GPT-OSS- 120B) and accelerator hardware (from NVIDIA Quadro RTX 6000 to Apple M4. Max)."
  • Anthropic Economic Index: A taxonomy mapping AI queries to occupational categories to assess economic relevance. "we use GPT-4O-MINI to annotate each query with a category from the Anthropic Economic Index (Handa et al., 2025)"
  • Best-of-local ensemble: A routing strategy that assigns each query to the most capable local model for that specific query. "across individual local LMs, the best-of-local ensemble (routing to the best local LM for each query), and the best-of-cloud baseline (routing to the best frontier model)."
  • Cloud accelerators: Enterprise-grade hardware in data centers specialized for high-efficiency AI inference. "Similarly, let Hcloud represent cloud accelerators (e.g., NVIDIA H200, AMD MI300X)."
  • Decoding: The generation phase of LLM inference where tokens are produced after context setup. "including both prefill and decoding phases."
  • Edge-cloud partitioning: Splitting model computation between local (edge) devices and cloud servers. "systems like SLED, HAT, and CE-CoLLM introduce edge-cloud partitioning with early-exit mechanisms (Li"
  • Early-exit mechanisms: Techniques that stop processing early when adequate confidence is reached, often in edge-cloud setups. "systems like SLED, HAT, and CE-CoLLM introduce edge-cloud partitioning with early-exit mechanisms (Li"
  • Frontier models: The largest, most capable state-of-the-art LLMs typically run in the cloud. "LLM queries are predominantly processed by frontier models deployed in centralized cloud infrastructure (OpenAI, 2025; Alvarez & Marsal, 2025)."
  • High-bandwidth memory (HBM3e): Advanced memory technology providing very high throughput for accelerators. "cloud accelerators like the B200 and SN40L employ purpose-built components-high-bandwidth memory (HBM3e), dedicated tensor processing units"
  • Intelligence per joule: End-to-end energy efficiency metric expressing accuracy per unit energy. "intelligence per joule (end-to-end energy efficiency per query)."
  • Intelligence per watt (IPW): Unified efficiency metric defined as task accuracy per unit of power consumption. "We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a unified metric"
  • LLM-as-a-judge: Using an LLM to evaluate and score model outputs against references. "We use LLM-as-a-judge (see App. B.1 for the respective prompts) to score generated responses against reference answers."
  • Mixture-of-Experts (MoE): A model architecture that routes inputs to a sparse subset of expert sub-networks to improve efficiency. "mixture-of-experts (MoE) architectures (Shazeer et al., 2017; DeepSeek-AI, 2024)"
  • NVML: NVIDIA Management Library for querying GPU telemetry such as power and energy counters. "querying NVML's energy counter for NVIDIA GPUs"
  • Oracle routing: An ideal router that perfectly assigns each query to the smallest capable model. "Oracle routing (per- fect assignment of each query to the smallest capable model) could reduce energy consumption by 80.4%"
  • Perplexity per joule (PPJ): Perplexity-based efficiency metric normalized by total energy per query. "* Perplexity per joule: PPJ(m,h) = Eq~@[ppl(m,q)].Eq~Q[P(m,h,q).T(m,h,q)]"
  • Perplexity per watt (PPW): Perplexity-based efficiency metric normalized by instantaneous power draw. "* Perplexity per watt: PPW(m,h) = Eq~@ [ppl(m,q)].Eq~Q[P(m,h,q)]"
  • Prefill: The initial inference phase that processes the prompt/context before token generation. "including both prefill and decoding phases."
  • ROCm SMI: AMD’s System Management Interface for accessing GPU telemetry (power, temperature, VRAM). "On AMD systems, we query ROCm SMI for power, temperature, and VRAM usage"
  • Routing function: A formal function that assigns each query to a local or cloud model. "A routing function r : Q-> Mlocal U.M cloud determines the assignment of each query to either a local or cloud model."
  • Speculative decoding: An acceleration technique where a smaller “draft” model’s outputs are verified by a larger model. "speculative decoding employs draft model verification (Miao et al., 2023; Xu et al., 2025)"
  • Telemetry: System-level measurements collected during inference (e.g., power, energy, latency). "records detailed telemetry-latency, throughput, time-to-first-token (TTFT), energy consumption, and more"
  • Time-to-first-token (TTFT): The latency from request submission to the first generated token. "latency, throughput, time-to-first- token (TTFT), energy consumption, and more"
  • Unified memory architectures: Hardware designs that share memory across compute components to increase capacity and ease deployment. "Local accelerators that offered 10-20 GB in 2020 now provide 128-512 GB through unified memory architectures like Apple Silicon"
  • Win/tie rate: The fraction of cases where a local model’s answer wins or ties against a frontier model. "win/tie rate versus frontier models increases from 23.2% (2023) to 71.3% (2025)"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 24 tweets and received 1266 likes.

Upgrade to Pro to view all of the tweets about this paper: