Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 30 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 12 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces (2410.15625v4)

Published 21 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.DC

Abstract: Modern scientific discovery increasingly relies on high-performance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known as mappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. We introduce a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. Our approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away the low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, our method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving 3.8X faster performance. Our approach finds mappers that surpass expert-written mappers by up to 1.34X speedup across nine benchmarks while reducing tuning time from days to minutes.

Collections

Summary

The paper presents an innovative DSL that streamlines mapper generation by abstracting complex C++ APIs and reducing low-level code complexity.
LLM-driven optimization iteratively refines mapping strategies, achieving performance improvements up to 1.34x in scientific and matrix multiplication tasks.
Experimental results confirm that the automated mappers consistently match or surpass expert-designed solutions, underscoring the potential for broader AI-driven system design.

Essay: Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

The paper presents a novel approach to optimizing parallel program performance by automating the generation of mappers through a domain-specific language (DSL) and LLM optimizers. This work addresses the complexities associated with task-based programming, where efficient mapping of computations to processors and data to memory is crucial yet labor-intensive.

Key Contributions

Domain-Specific Language (DSL) for Mapper Generation: The paper introduces a DSL that abstracts away intricate low-level C++ APIs, simplifying the code generation process for LLMs. This DSL reduces code complexity significantly, allows high-level mappings, and structures the search space for code generation.
LLM-Driven Optimization: By framing mapper generation as a discrete optimization problem, the authors leverage LLMs to explore the vast search space of potential mappings. The generated mappers are optimized iteratively using feedback from system performance evaluations.
Performance Improvements: Experimental results demonstrate that LLM-generated mappers can outperform expert-designed mappers, achieving up to 1.34x speedup in scientific applications and up to 1.31x in parallel matrix multiplication algorithms.

Methodology

The paper highlights two main challenges: generating syntactically correct mapper code and finding optimal mapping strategies. The DSL is designed to enable LLMs to effectively generate mapping code by encapsulating critical mapping decisions. These include processor selection, memory placement, data layout, and index mapping.

The paper uses an agent-based system implemented with Trace, allowing feedback-driven iterative refinement of DSL mappers. This system feeds performance metrics back to the LLM optimizer, enabling the refinement of mappers under a structured search space defined by the DSL.

Experimental Validation

Experiments involved multiple applications, including circuit simulation and matrix multiplication algorithms. The DSL reduced the lines of code required from hundreds to tens, proving advantageous over traditional C++ implementations. The generated mappers consistently matched or surpassed the performance of expert mappers.

The authors also conducted an ablation paper on feedback types, reinforcing that high-quality feedback significantly guides LLM optimizers toward efficient mappers.

Implications and Future Work

The paper suggests that LLM-based optimization using a DSL can significantly reduce the workload of performance engineers while achieving substantial performance gains. This opens avenues for broader application of LLM optimizers in complex system design challenges beyond parallel programming, such as automatic tuning in various software systems.

Future directions could include refining the DSL for broader applicability, exploring more optimization techniques within LLM frameworks, and extending the approach to other domains that require software optimization.

Conclusion

This work advances the field by integrating DSL and LLM technologies to automate and optimize parallel program performance, demonstrating practical effectiveness in complex computational tasks. The approach not only alleviates the tediousness of manual mapper coding but also explores the potential of LLMs to automate sophisticated system-level code generation tasks, setting the stage for further innovations in AI-driven system design.