LocalBench: County-Level LLM Benchmark

Updated 3 July 2026

LocalBench is a benchmark evaluating LLMs on county-level local knowledge, capturing detailed community statistics, cultural narratives, and governance insights.
It leverages 14,782 validated QA pairs drawn from census data, local subreddits, and regional news across 526 U.S. counties to assess both numerical and narrative reasoning.
The benchmark employs a multi-stage validation pipeline and a Localness Conceptual Framework to analyze models’ performance on physical, cognitive, and relational dimensions of locality.

LocalBench is a benchmark for evaluating LLMs on county-level local knowledge and reasoning across the United States. It is presented as the first benchmark designed to systematically evaluate LLMs on county-level local knowledge, targeting hyper-local phenomena that are not well captured by macro-scale geographic evaluations, including county- and neighborhood-specific statistics, cultural narratives, local governance, and local vernacular. The benchmark is grounded in the Localness Conceptual Framework and contains 14,782 validated question-answer pairs spanning 526 U.S. counties in 49 states, enabling comparison of closed-book and web-augmented models across physical, cognitive, and relational dimensions of locality (Gao et al., 13 Nov 2025).

1. Motivation and evaluative scope

LocalBench was introduced to address a specific gap in geographic evaluation of LLMs. Existing evaluations had emphasized macro-scale tasks such as global factual recall, event summarization, and regional reasoning, while leaving hyper-local competence poorly understood. This gap is consequential because applications such as civic platforms and community journalism require systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance (Gao et al., 13 Nov 2025).

The benchmark’s notion of locality is explicitly county-level and community-centered. It is not restricted to structured facts such as official statistics, nor to isolated cultural references. Instead, it treats local knowledge as a compound object comprising quantitative indicators, discourse-level community knowledge, and place-specific interpretive material. In this sense, LocalBench operationalizes “local knowledge” as a mixture of factual recall, numerical reasoning, comparison, ranking, and narrative interpretation.

A common misconception is that geographic knowledge benchmarks already test this capability adequately. LocalBench rejects that assumption by arguing that benchmarks built from coarse-grained data or isolated references do not capture the fine-grained realities required for place-aware AI. Its central contribution is therefore not merely more data, but a change in evaluation granularity: from regional or national abstractions to county-level questions tied to concrete communities.

2. Localness Conceptual Framework

LocalBench is grounded in the Localness Conceptual Framework, which decomposes local knowledge into three interwoven domains and seven dimensions (Gao et al., 13 Nov 2025).

Physical Localness: Place Interaction and Temporal Presence.
Cognitive Localness: Cultural Understanding, Environmental Cognition, and Local Knowledge.
Relational Localness: Emotional Connection and Social/Community Engagement.

These dimensions are further resolved into 88 subcomponents. Examples given in the benchmark description include comfort navigating local built and natural spaces for Place Interaction, length of residence and home ownership rates for Temporal Presence, local slang, festivals, and cuisine for Cultural Understanding, land use and ecological familiarity for Environmental Cognition, insider tips and historical changes for Local Knowledge, affective bonds to place for Emotional Connection, and voting statistics and civic participation for Social/Community Engagement.

This framework serves two functions. First, it provides a coverage model for dataset construction, ensuring that the benchmark is not dominated by one subtype of locality such as census numerics or regional news. Second, it allows each QA pair to be mapped to one or more dimensions, making it possible to analyze model behavior not just by source or task type, but by the conceptual form of localness being tested. A plausible implication is that LocalBench is intended as both a benchmark and a representational schema for hyper-local AI evaluation.

3. Dataset composition and validation pipeline

The benchmark contains 14,782 validated QA pairs drawn from 526 unique U.S. counties in 49 states. County selection is balanced by Rural-Urban Continuum Code, with RUCC 1–3 treated as urban, 4–6 as suburban, and 7–9 as rural, and the sample is stratified across all three RUCC groups (Gao et al., 13 Nov 2025).

Source	QA pairs	Coverage note
Structured Census Indicators	6,120	34 metrics from U.S. Census Bureau, USDA, National Register of Historic Places, CDC, etc.
Local Subreddit Discourse	4,000	Posts and top-50 comments from county-level subreddits, Jan 2024–Mar 2025
Regional News	4,662	Articles from the NELA-Local corpus, Apr 2020–Dec 2021

The construction pipeline has four stages. In raw generation, the OpenAI o3 model with temp=0.7 and top-p=0.9 produces one to three QA pairs per document. In the multi-rule filter, a DPO-tuned GPT-4o-mini model applies nine criteria: single factual answer, geographic grounding, subjectivity filtering, privacy, safety, temporal consistency, difficulty, clarity, and completeness. In feedback-driven regeneration, failed pairs undergo up to three iterative refinements. The final stage is human verification.

Human verification is reported quantitatively. For quality filters against 500 human-annotated samples, the benchmark reports $F1 = 0.94$ and $\kappa = 0.84$ . For classification of Localness dimensions against 200 human labels, it reports precision $94.2\%$ , domain-level precision $98.5\%$ , and $\kappa = 0.87$ . These values are used to support the claim that the automated curation and labeling pipeline aligns closely with human judgment.

The inclusion of subreddit discourse and regional news is methodologically significant. Census-derived questions test structured factual retrieval and quantitative comparison, whereas local discourse and news supply narrative material that requires models to recover community norms and locally situated interpretation. This broadens evaluation beyond official data toward the forms of locality that often matter in actual civic and cultural use cases.

4. Task taxonomy and evaluation protocol

LocalBench organizes its QA pairs into three high-level task types: factual recall or numerical reasoning, comparison and ranking, and narrative-style or interpretive QA (Gao et al., 13 Nov 2025). The numerical tasks include prompts such as median household income or unemployment comparisons. Narrative-style questions draw on subreddit or news discourse and require understanding of local traditions, norms, or community-specific meanings.

Evaluation is performed in two settings. In the closed-book setting, models answer from pre-trained knowledge only, with no retrieval. In the web-augmented setting, models use integrated search APIs, with examples including GPT-4.1+Web and Gemini-2.5-Pro+Grounding. Thirteen state-of-the-art LLMs are evaluated, including proprietary models such as GPT-4o, GPT-o3, GPT-4.1, Gemini-2.5-Pro and Flash, Claude-4-Sonnet, and Claude-3.7-Sonnet, as well as open-source Qwen3 variants.

The protocol fixes Temperature=0.0, max_tokens=256, and uses three runs per model. Multiple metrics are applied. Factual correctness is scored with Exact Match, ROUGE-1 F1, and embedding-based Semantic Match. Numerical accuracy is judged correct when

$\frac{|pred - gold|}{gold} < 2\%$

with exact zero required if $gold = 0$ . The benchmark also uses GPT Judge Accuracy, implemented with GPT-4o-mini as a binary “Correct/Incorrect” judge with 96% human alignment. Answer Rate is defined as the proportion of non-empty, non-“I don’t know” responses.

For aggregate performance, the benchmark states:

$\mathrm{Accuracy} = \frac{\#\ \mathrm{correct\ responses}}{\mathrm{Total\ questions}}$

This design reflects a hybrid evaluation philosophy. Some items admit exact or near-exact scoring, particularly structured numerics; others require semantic or judgment-based evaluation because the answer space is locally grounded yet lexically variable.

5. Empirical results and observed failure modes

The benchmark reports substantial limitations even for top-performing models. On numerical reasoning over Census QA pairs, the closed-book baseline GPT-4.1 reaches 6.2% accuracy, and the best web-augmented result, GPT-4.1+Web, reaches 15.5%. Across other models, numerical accuracy ranges from 2.2% to 12.8%, and no model exceeds 15.5% (Gao et al., 13 Nov 2025).

On narrative-style non-numerical QA, GPT-4.1 in the closed-book setting reaches 47.0% according to the GPT judge, Gemini-2.5-Pro+Grounding achieves the highest result at 56.8%, and Claude-4-Sonnet in the closed-book setting reaches 39.7%. These figures are central to the benchmark’s argument: current LLMs remain far from reliable on hyper-local interpretive tasks, even when augmented with retrieval.

One of the most emphasized findings is the non-monotonic effect of web augmentation. Gemini-2.5-Pro improves by +13.6%, moving from 43.2% to 56.8% on narrative accuracy. GPT-4.1, by contrast, drops by -11.4%, moving from 47.0% to 35.6%. The benchmark therefore rejects the assumption that search augmentation necessarily improves local QA. Instead, retrieval can either help or hurt, depending on how well a model reconciles external local documents with its parametric knowledge.

The reported error patterns are specific and recurrent. Models hallucinate local “facts,” including invented landmarks and made-up statistics. They exhibit temporal mismatches by citing outdated or future data. They also show cultural flattening, producing generic descriptions for small or rural communities. These failure modes indicate that local knowledge errors are not merely instances of ordinary factual inaccuracy; they often involve misrepresentation of place-specific social and temporal context.

The benchmark offers several analyses of why performance is weak. For numerical reasoning, it argues that transformer next-token objectives struggle with precise quantitative local data, and that models often decline to answer or hallucinate estimates. For retrieval, it describes a “retrieval integration paradox”: Gemini appears to benefit from web grounding through robust filtering of external noise, whereas GPT-4.1’s decline suggests difficulty reconciling retrieved local documents with parametric knowledge. It also reports that larger or MoE models such as Qwen3-235B do not outperform smaller dense models on these tasks, and concludes that scaling laws that hold for broad language tasks break down for hyper-local reasoning (Gao et al., 13 Nov 2025).

These findings situate LocalBench within a broader landscape of “local” benchmarks, while also distinguishing its object of study. LocalValueBench, for example, evaluates adherence to Australian local norms, laws, and ethical precepts through a three-layer typology of neutral, debate, and interrogation prompts; LocalBench instead evaluates county-level local knowledge and reasoning across physical, cognitive, and relational dimensions of place (Meadows et al., 2024). The shared terminology of “local” therefore spans distinct evaluation targets: normative alignment in one case and geographic-community knowledge in the other.

The future directions proposed for LocalBench are architectural, retrieval-oriented, and participatory. The benchmark suggests place-aware architectures, including explicit spatial reasoning modules such as geo-embeddings and graph-based locality maps, as well as enhanced numerical computation sub-modules or integration with structured data APIs. It also proposes specialized retrieval and indexing, including dedicated local corpora and search indices built from sources such as county tax records and local forums. A further proposal is participatory data curation, in which local communities contribute to QA creation and validation to capture cultural nuance, alongside support for multilingual and Indigenous-language sources. Finally, the benchmark proposes extensions beyond U.S. counties, adaptation of the Localness framework to other governance and data contexts, and dynamic QA reflecting evolving local events such as municipal decisions and neighborhood development (Gao et al., 13 Nov 2025).

A recurring misconception in discussions of benchmark progress is that higher general LLM capability should transfer automatically to finer-grained local tasks. LocalBench presents evidence against that view. Its results suggest that county-level reasoning is not simply a harder version of broad factual QA, but a distinct regime in which temporality, fragmented evidence, community-specific language, and structured quantitative data interact in ways that current models handle inconsistently.

Markdown Report Issue Upgrade to Chat

References (2)

LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning (2025)

LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LocalBench.