Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lingma SWE-GPT 72B: Process-Centric LLM

Updated 7 June 2026
  • Lingma SWE-GPT 72B is a process-centric large language model with 72B parameters designed to automate real-world software maintenance tasks.
  • It integrates static code and dynamic interaction traces, achieving a 30.2% GitHub issue resolution rate and outperforming larger code-only baselines.
  • The model uses a Transformer decoder with added process tokens for iterative reasoning, enhancing fault localization accuracy and patch quality.

Lingma SWE-GPT 72B is an open, process-centric LLM designed for automated software improvement. Built on a 72 billion parameter Transformer backbone, it explicitly models the dynamic, tool-mediated, and iterative workflows characteristic of real-world software maintenance tasks. In benchmark evaluations, it approaches the performance of state-of-the-art closed-source systems in automated GitHub issue resolution, establishing new standards for openness and process fidelity in automated software engineering (Ma et al., 2024).

1. Architectural Characteristics

Lingma SWE-GPT 72B is structured on a standard Transformer decoder stack, employing an autoregressive generative approach. Its architectural backbone is inherited from Qwen2.5-Coder, with the following prototypical configuration:

  • Number of Transformer layers L80L \approx 80
  • Hidden size dmodel12800d_\text{model} \approx 12\,800
  • Feed-forward inner dimension dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,200
  • Number of attention heads h64h \approx 64; attention head dimension dk=dmodel/h200d_k = d_\text{model}/h \approx 200

The aggregate parameter count is dominated by self-attention and feed-forward matrices, reasonably approximated as P12Ldmodel2P \approx 12 \cdot L \cdot d_\text{model}^2, yielding the nominal 72B parameters.

The core attention mechanism preserves standard multi-head attention:

  • Single-head: Attention(Q,K,V)=softmax(QK/dk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}(QK^\top/\sqrt{d_k})V
  • Multi-head: MultiHead(Q,K,V)=Concati[headi]WO\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}_i[\mathrm{head}_i] W^O, where headi\mathrm{head}_i uses projections QWiQQW_i^Q, dmodel12800d_\text{model} \approx 12\,8000, dmodel12800d_\text{model} \approx 12\,8001

Process-centric modifications are architectural overlays: prompt streams are extended with “process tokens” and explicit API-invocation instructions, enabling the model to emit Chain-of-Thought (CoT) reasoning steps and tool calls (e.g., project structure listing, AST-based queries) as output. No lower-level network changes are introduced (Ma et al., 2024).

2. Process-Centric Pretraining and Dataset Design

Training data for Lingma SWE-GPT 72B systematically integrates both static and dynamic elements of software development:

  • Static code: dmodel12800d_\text{model} \approx 12\,800290,000 merged pull requests from 4,000 prominent GitHub repositories, each representing 1–5 file modifications and excluding test-only patches.
  • Dynamic interaction traces: For every issue/PR pair, the dataset encodes the natural language issue, a hierarchical repository tree, developer CoT (commit messages), and code-level diffs.
  • Synthesized process trajectories: The SWESynInfer pipeline generates synthetic “thought+action+observation” trajectories via a three-stage sequence: Repository Understanding, Fault Localization, and Patch Generation.

To ensure quality and process fidelity, data traces are filtered using:

  • Fault-localization similarity: Jaccard index dmodel12800d_\text{model} \approx 12\,8003 (dmodel12800d_\text{model} \approx 12\,8004)
  • Patch similarity: normalized CodeBLEU or n-gram overlap dmodel12800d_\text{model} \approx 12\,8005 (dmodel12800d_\text{model} \approx 12\,8006)

Only process traces meeting these thresholds are retained via rejection sampling, enforcing high-fidelity inputs for pretraining (Ma et al., 2024).

3. Instruction Tuning and Curriculum Learning

Fine-tuning leverages SWESynInfer-synthesized data, with each example presented as a sequence of observation and CoT+action pairs. The learning objective maximizes the conditional likelihood:

dmodel12800d_\text{model} \approx 12\,8007

A curriculum-based regime spans 90 total iterations, with the initial ten iterations seeded by GPT-4o traces to bootstrap process fidelity. After every ten iterations, batches are updated, consistently carrying forward unsolved examples in increasing complexity. This incremental approach aligns with realistic developer workflows and enhances the model’s ability to generalize across varied tasks (Ma et al., 2024).

The inference procedure for SWE-bench Verified applies pass@3—sampling three completions per input—with a temperature dmodel12800d_\text{model} \approx 12\,8008 and a maximum length of 1024 tokens. No additional environment or tool execution occurs at inference; all prompt templates already encode the necessary tool logic.

4. Benchmark Evaluation and Comparative Metrics

Evaluation utilizes the SWE-bench Verified benchmark, quantifying the rate dmodel12800d_\text{model} \approx 12\,8009 of successfully resolved GitHub issues (out of 500 total).

Model Resolution (%) Comments
Llama 3.1 405B 24.62 Open-source baseline
Lingma SWE-GPT 72B 30.20 dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2000 over Llama
GPT-4o 31.80 Closed-source oracle
  • Relative improvement over Llama 3.1 405B: dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2001
  • Consistency: Three independent runs yield (30.20%, 29.00%, 30.20%), with mean dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2002 and standard deviation dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2003
  • 95% confidence interval: dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2004 (t-test, dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2005)
  • pass@3 performance: 39.80%, exceeding Claude 3.5 Sonnet (35.40%)
  • Smaller variant Lingma SWE-GPT 7B: 18.20% (versus Llama 3.1 70B: 17.20%)

These results demonstrate that process-centric augmentation not only closes much of the gap with closed-source oracles (GPT-4o: 31.80%) but also delivers strong, reproducible consistency (Ma et al., 2024).

5. Process Fidelity and Algorithmic Innovations

Central to Lingma SWE-GPT 72B’s approach is process-centric training—explicit modeling of developer workflows, including repository tree examination, AST query invocation, and iterative “think–act–observe” loops. This fosters improved fault localization accuracy (exceeding 72% at file level) and patch quality relative to pretraining on static code alone.

The SWESynInfer data synthesis pipeline applies a rigorous filtering algorithm:

dff=4×dmodel51200d_{ff} = 4 \times d_\text{model} \approx 51\,2006

By only using samples above set Jaccard (≥ 0.6) and CodeBLEU (≥ 0.5) thresholds, the training focuses on high-fidelity, process-faithful trajectories.

A significant observation is the outsized impact of process-oriented data over model scale: even the 7B-parameter Lingma variant surpasses much larger code-only baselines on software engineering tasks, indicating that modeling the process sequence is more beneficial than parameter scaling alone for complex software improvement tasks (Ma et al., 2024).

6. Implications and Context in Automated Software Engineering

Lingma SWE-GPT 72B demonstrates that open-source, process-modeled LLMs can rival or approach proprietary models in automating software maintenance and evolution tasks. Its design addresses two major challenges:

  • Accessibility: By matching closed-source performance with fully open weights,
  • Process understanding: By modeling the iterative, interactive workflows inherent to real development, not only static code artifacts.

A plausible implication is the increasing role of process-centric modeling for complex, tool-mediated domains beyond software engineering. Additionally, the demonstrated performance of smaller process-trained models supports the utility of dataset curation and data-centric approaches over brute-force scaling, particularly in domains with rich, structured task workflows (Ma et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lingma SWE-GPT 72B.