Lingma SWE-GPT 72B: Process-Centric LLM
- Lingma SWE-GPT 72B is a process-centric large language model with 72B parameters designed to automate real-world software maintenance tasks.
- It integrates static code and dynamic interaction traces, achieving a 30.2% GitHub issue resolution rate and outperforming larger code-only baselines.
- The model uses a Transformer decoder with added process tokens for iterative reasoning, enhancing fault localization accuracy and patch quality.
Lingma SWE-GPT 72B is an open, process-centric LLM designed for automated software improvement. Built on a 72 billion parameter Transformer backbone, it explicitly models the dynamic, tool-mediated, and iterative workflows characteristic of real-world software maintenance tasks. In benchmark evaluations, it approaches the performance of state-of-the-art closed-source systems in automated GitHub issue resolution, establishing new standards for openness and process fidelity in automated software engineering (Ma et al., 2024).
1. Architectural Characteristics
Lingma SWE-GPT 72B is structured on a standard Transformer decoder stack, employing an autoregressive generative approach. Its architectural backbone is inherited from Qwen2.5-Coder, with the following prototypical configuration:
- Number of Transformer layers
- Hidden size
- Feed-forward inner dimension
- Number of attention heads ; attention head dimension
The aggregate parameter count is dominated by self-attention and feed-forward matrices, reasonably approximated as , yielding the nominal 72B parameters.
The core attention mechanism preserves standard multi-head attention:
- Single-head:
- Multi-head: , where uses projections , 0, 1
Process-centric modifications are architectural overlays: prompt streams are extended with “process tokens” and explicit API-invocation instructions, enabling the model to emit Chain-of-Thought (CoT) reasoning steps and tool calls (e.g., project structure listing, AST-based queries) as output. No lower-level network changes are introduced (Ma et al., 2024).
2. Process-Centric Pretraining and Dataset Design
Training data for Lingma SWE-GPT 72B systematically integrates both static and dynamic elements of software development:
- Static code: 290,000 merged pull requests from 4,000 prominent GitHub repositories, each representing 1–5 file modifications and excluding test-only patches.
- Dynamic interaction traces: For every issue/PR pair, the dataset encodes the natural language issue, a hierarchical repository tree, developer CoT (commit messages), and code-level diffs.
- Synthesized process trajectories: The SWESynInfer pipeline generates synthetic “thought+action+observation” trajectories via a three-stage sequence: Repository Understanding, Fault Localization, and Patch Generation.
To ensure quality and process fidelity, data traces are filtered using:
- Fault-localization similarity: Jaccard index 3 (4)
- Patch similarity: normalized CodeBLEU or n-gram overlap 5 (6)
Only process traces meeting these thresholds are retained via rejection sampling, enforcing high-fidelity inputs for pretraining (Ma et al., 2024).
3. Instruction Tuning and Curriculum Learning
Fine-tuning leverages SWESynInfer-synthesized data, with each example presented as a sequence of observation and CoT+action pairs. The learning objective maximizes the conditional likelihood:
7
A curriculum-based regime spans 90 total iterations, with the initial ten iterations seeded by GPT-4o traces to bootstrap process fidelity. After every ten iterations, batches are updated, consistently carrying forward unsolved examples in increasing complexity. This incremental approach aligns with realistic developer workflows and enhances the model’s ability to generalize across varied tasks (Ma et al., 2024).
The inference procedure for SWE-bench Verified applies pass@3—sampling three completions per input—with a temperature 8 and a maximum length of 1024 tokens. No additional environment or tool execution occurs at inference; all prompt templates already encode the necessary tool logic.
4. Benchmark Evaluation and Comparative Metrics
Evaluation utilizes the SWE-bench Verified benchmark, quantifying the rate 9 of successfully resolved GitHub issues (out of 500 total).
| Model | Resolution (%) | Comments |
|---|---|---|
| Llama 3.1 405B | 24.62 | Open-source baseline |
| Lingma SWE-GPT 72B | 30.20 | 0 over Llama |
| GPT-4o | 31.80 | Closed-source oracle |
- Relative improvement over Llama 3.1 405B: 1
- Consistency: Three independent runs yield (30.20%, 29.00%, 30.20%), with mean 2 and standard deviation 3
- 95% confidence interval: 4 (t-test, 5)
- pass@3 performance: 39.80%, exceeding Claude 3.5 Sonnet (35.40%)
- Smaller variant Lingma SWE-GPT 7B: 18.20% (versus Llama 3.1 70B: 17.20%)
These results demonstrate that process-centric augmentation not only closes much of the gap with closed-source oracles (GPT-4o: 31.80%) but also delivers strong, reproducible consistency (Ma et al., 2024).
5. Process Fidelity and Algorithmic Innovations
Central to Lingma SWE-GPT 72B’s approach is process-centric training—explicit modeling of developer workflows, including repository tree examination, AST query invocation, and iterative “think–act–observe” loops. This fosters improved fault localization accuracy (exceeding 72% at file level) and patch quality relative to pretraining on static code alone.
The SWESynInfer data synthesis pipeline applies a rigorous filtering algorithm:
6
By only using samples above set Jaccard (≥ 0.6) and CodeBLEU (≥ 0.5) thresholds, the training focuses on high-fidelity, process-faithful trajectories.
A significant observation is the outsized impact of process-oriented data over model scale: even the 7B-parameter Lingma variant surpasses much larger code-only baselines on software engineering tasks, indicating that modeling the process sequence is more beneficial than parameter scaling alone for complex software improvement tasks (Ma et al., 2024).
6. Implications and Context in Automated Software Engineering
Lingma SWE-GPT 72B demonstrates that open-source, process-modeled LLMs can rival or approach proprietary models in automating software maintenance and evolution tasks. Its design addresses two major challenges:
- Accessibility: By matching closed-source performance with fully open weights,
- Process understanding: By modeling the iterative, interactive workflows inherent to real development, not only static code artifacts.
A plausible implication is the increasing role of process-centric modeling for complex, tool-mediated domains beyond software engineering. Additionally, the demonstrated performance of smaller process-trained models supports the utility of dataset curation and data-centric approaches over brute-force scaling, particularly in domains with rich, structured task workflows (Ma et al., 2024).