Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Published 20 Apr 2026 in cs.AI, cs.HC, and cs.LG | (2604.18566v2)

Abstract: We present a systematic evaluation of LLM families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx_lm) backends, and quantization levels (Q3 / Q4_K_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

Abstract PDF Upgrade to Chat

Authors (1)

Terry Leitch

Summary

The paper establishes that architecture type—not model size or quantization—is the key determinant in performance for structured CLD extraction and discussion tasks.
The evaluation employs a rigorous benchmark of 53 tests across tasks such as conformance, causal reasoning, iterative updates, and translation to compare cloud and local LLMs.
The study offers practical deployment insights, showing that edge AI appliances can achieve competitive accuracy with improved energy efficiency and data governance.

Systematic Benchmarking of System Dynamics AI Assistants: Cloud Versus Local LLMs on Causal Loop Diagram Extraction and Discussion

Introduction

This paper conducts a rigorous evaluation of proprietary cloud LLM APIs and locally-hosted open-source models in the domain of System Dynamics, specifically for the extraction of Causal Loop Diagrams (CLDs) and interactive discussion tasks. The study introduces two distinct benchmarks: the CLD Leaderboard for structured CLD extraction and the Discussion Leaderboard for model analysis, coaching, and error-fixing. The evaluation is grounded in strict schema conformity and exact structured matching, enabling granular differentiation of model capability across architecture class (reasoning vs. instruction-tuned), inference backend (llama.cpp vs. MLX), and quantization tier.

Benchmark Structure and Evaluation Methodology

The CLD Leaderboard comprises 53 tests spanning conformance, qualitative causal reasoning, iterative model building, and translation from domain-specific passages. Discussion tasks further stress the models with interactive coaching, feedback explanation, and error-fixing, demanding genuine domain understanding and long-context processing. Scoring is exact: a test passes only if the output JSON matches the predefined schema after normalization.

Prompting strategies are carefully controlled, with zero-shot and few-shot variants for CLD extraction and a unified mentoring engine for Discussion tasks. All runs use deterministic configurations (seed 4242), and local models are deployed via Apple Silicon with standardized toolchains, ensuring reproducibility and robust cross-model comparisons.

Models and Deployment Tiers

Cloud APIs evaluated include recent OpenAI GPT, Anthropic Claude, and Google Gemini series, all instruction-tuned. The local open-source model pool consists of frontier-scale architectures (Kimi K2.5, DeepSeek V3.2, Qwen 3.5) and smaller task-specialized models (GLM-5, Llama 4 Maverick), tested at multiple quantization levels and backend configurations (GGUF and MLX). Dense architectures and MoE variants are distinguished due to known inference and grammar template behavior differences.

Architecture Class Effects and Prompting Interactions

A critical finding is that architecture class—not parameter count or quantization—is the primary determinant of task-specific performance. Reasoning models (e.g., Kimi K2.5, GLM-5) display strong conformance and causal reasoning, but degrade sharply at non-zero temperature and prefer zero-shot prompting. Instruction-tuned models exhibit greater robustness to prompt styles but lower maximal scores on tasks requiring domain reasoning or iterative update. Backend effects are substantial: LLama.cpp secures strict JSON compliance, but is prone to hangs on dense models with long-context grammar sampling. MLX lacks built-in schema enforcement and necessitates explicit prompt engineering.

Zero-shot preference is empirically verified for reasoning models, likely due to susceptibility to anchoring from in-context examples. Top- $k$ sampling combined with top- $p$ shows variable efficacy, benefitting select reasoning models but not instruction-tuned ones.

Quantization and Inference Backend Insights

At scale (397B–671B), Q3–Q4 and MLX-4–6bit quantization do not materially degrade extraction quality, corroborating prior quantization literature (Frantar et al., 2022). Backend distinctions affect category profiles: MLX excels at conformance, LLama.cpp at iteration and translation. Latency and context window limitations are backend-driven and must be considered in deployment.

CLD Leaderboard and Discussion Leaderboard Results

Cloud models reach 77–89% pass rates on CLD extraction. Kimi K2.5 GGUF Q3 (local, zero-shot, $t$ =0) achieves 77%, competitive with mid-tier cloud performance. Discussion tasks show the strongest local performance on model building (100%), reasonable feedback explanation rates (47–75%), but lower error-fixing accuracy (0–50%) due to context window constraints in local deployments.

Iterative model building remains a persistent challenge: only GLM-5 (9B) approaches cloud-level iteration (6/8), with all other local models failing beyond basic updates. Structured translation is competitive—local models match or nearly match the cloud ceiling. Causal reasoning remains difficult for all but reasoning architectures, and conformance is reliably achieved locally with proper backend selection.

Practical Implications and Deployment Guidance

Inference backend selection and prompt engineering are essential for practical deployment. MLX-based systems require explicit JSON output instructions; LLama.cpp demands disabling grammar sampling for dense architectures in long-context tasks. Quantization selection (Q4/MLX-4+) is feasible with negligible accuracy loss for structured tasks.

Energy scenario analysis indicates that, under realistic enterprise utilisation rates (15–40%), Mac Studio clusters are 2 $\times$ more energy-efficient per query than dedicated H100 GPU server deployments, with vastly reduced cooling and infrastructure requirements. Shared cloud APIs remain optimal for maximal batching efficiency, but edge AI appliances offer substantial sustainability and data sovereignty advantages for domain- and institution-specific deployments.

Task Routing and Edge AI Appliance Concept

Analysis confirms the viability of category-routed deployment: different models excel in different task categories. Routing iterative tasks to GLM-5, translation and conformance to Kimi K2.5, and conformance checks to DeepSeek V3.2 MLX-4 delivers post hoc aggregation performance (91%) exceeding the best cloud API. The SD AI appliance concept emerges as practically viable: dedicated edge-device deployments, requiring only standard office infrastructure, match cloud performance across most workloads and provide critical data governance in regulated sectors (healthcare, defense, private industry).

Limitations and Future Prospects

Results are deterministic (single seed) and hardware-specific (Apple Silicon, Mac Studio 512GB); extension to broader hardware must be empirically verified. The architecture classification is operational and behavioral, not strictly tied to internal model design. Context window constraints currently block full Discussion coverage; advances in inference frameworks may alleviate these limitations.

Prospective developments include more robust local model architectures for iterative tasks, further model size optimization for mid-tier edge devices, improved task-routing middleware, and general advances in open-source domain-specific LLMs. The SD AI appliance paradigm offers a direction for sustainable, privacy-preserving, and performant AI assistance in System Dynamics and related fields.

Conclusion

This study demonstrates that local, task-routed open-source models are competitive with, and in select categories can exceed, proprietary cloud LLMs for System Dynamics CLD extraction and discussion tasks. Architecture class and inference backend dominate performance outcomes, not parameter count or quantization. Iterative structured editing remains an unsolved challenge for most local models, but can be addressed by targeted architectural or training interventions. Energy and infrastructure analysis supports the deployment of edge AI appliances for sustainable and compliant domain-specific AI workflows. The findings advocate for a nuanced, task-driven approach to AI deployment, emphasizing practical infrastructure selection and wider accessibility of high-performance local inference (2604.18566).

Markdown Report Issue