< Explain other AI papers

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

2025-11-12

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

Summary

This paper investigates whether we can shift some of the work done by huge, cloud-based AI models to smaller models running directly on our personal devices like laptops, and if doing so is actually practical and efficient.

What's the problem?

Currently, almost all requests to large language models, like those powering chatbots, are handled by massive computer systems in data centers owned by big companies. This system is struggling to keep up with the increasing demand, and it's hard for these companies to quickly build enough new infrastructure. The question is whether smaller, but still capable, AI models running on your own computer could take some of the load off these overwhelmed systems.

What's the solution?

The researchers measured how well these smaller AI models perform on real-world questions, and how much energy they use while doing so. They tested over 20 different small AI models on 8 different types of computer chips found in laptops and other devices, using a million actual chat and reasoning questions. They created a new measurement called 'intelligence per watt' – basically, how accurate the AI is for each unit of power it consumes – to compare different combinations of models and hardware. They then tracked how this 'intelligence per watt' improved over time.

Why it matters?

The study found that smaller AI models can accurately answer a large percentage of questions (around 89%), and that the efficiency of running these models locally has improved dramatically in the last couple of years. Importantly, running models on local devices is more efficient than using cloud servers. This suggests that we *can* start shifting some AI processing to our own devices, reducing the strain on large data centers and potentially making AI more accessible and responsive.

Abstract

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.