Starjob Dataset for Job Shop Scheduling

Updated 3 September 2025

Starjob is a large-scale dataset designed for LLM-driven job shop scheduling, converting canonical matrix-based JSSP tasks into richly described natural language scenarios with 130,000 instances.
It encodes scheduling tasks in natural language including detailed arithmetic reasoning through explicit operation sequences and makespan calculations.
It enables LLMs to outperform traditional heuristics by achieving substantial performance gains, with improvements up to 15.36 percentage points on benchmark datasets.

The Starjob dataset is the first large-scale, supervised natural language dataset specifically constructed for LLM-driven job shop scheduling problem (JSSP) research. Comprising 130,000 meticulously formatted instances, Starjob is designed to translate canonical matrix-based JSSP tasks into richly described natural language scenarios, facilitating direct end-to-end training and evaluation of LLMs on this canonical combinatorial optimization problem. By pairing each problem instance with an explicit natural language solution including a sequence of scheduled operations, start and end times, and an arithmetic summation, Starjob enables LLMs to perform complex scheduling reasoning and explicit arithmetic with human-interpretable outputs. This approach departs from traditional combinatorial optimization datasets by emphasizing the integration of linguistic reasoning with structured optimization targets.

1. Dataset Design and Structure

Starjob consists of 130,000 JSSP instances, each representing a discrete scheduling task characterized by a set of jobs, machines, and sequential operations with specified processing times. The dataset covers a broad spectrum of problem sizes, ranging from small examples (e.g., $2\times2$ and $3\times2$ ) to large industrially relevant cases (up to $20\times20$ ), and includes additional larger or asymmetric configurations (e.g., $30\times15$ , $50\times20$ ) to support compositional generalization.

Each data instance encodes the problem in natural language, specifying:

The number of jobs ( $J$ ) and machines ( $M$ )
For each job, an ordered list of operations, each denoted by machine identifier (e.g., "M0", "M1") and the processing time (e.g., "105", "29")
The global objective as a natural language statement ("minimize makespan")

The solution component provides a complete, natural language-encoded schedule with:

The ordered assignment of jobs to machines
The start and end times for each operation
A makespan calculation expressed in summation format (e.g., "J2-M0: 0+78 -> 78, …, Maximum end completion time or Makespan: 488") to guide the LLM in performing chained arithmetic

This dual representation simultaneously demands and supports both high-level reasoning about sequencing and low-level numeric computation.

2. Fine-tuning Methodology for LLMs

The primary model utilized in the Starjob research is Meta’s LLaMA-3.1 8B, quantized to 4 bits for memory efficiency. The model undergoes supervised fine-tuning on the problem-solution pairs, with a focus on transfer and adaptation to the combinatorial scheduling domain.

Fine-tuning employs Rank-Stabilized Low-Rank Adaptation (rsLoRA), a modification of standard LoRA, facilitating higher-rank adaptation without loss of training stability. The scaling factor is given by $\gamma_r = \alpha / \sqrt{r}$ , where $\alpha$ is a fixed scaling coefficient and $r$ is the adapter rank. In practice, $r = 64$ and $\alpha = 64$ yield $\gamma_{64} = 8$ . This modification targets variance stability across increased ranks:

Only the adapter parameters (matrices $U$ , $V$ in $\Delta\phi = U \cdot V^\top$ ) are updated; the main model remains frozen.
The objective is the standard sequence-level negative log-likelihood (NLL):

$\mathcal{L}_{NLL}(\theta; L_p, s) = -\sum_t \log p(w_t\,|\,w_{<t}, L_p; \theta)$

where $w_t$ denotes each token in the concatenated problem and solution sequence.

Resource usage during fine-tuning is modest: 70 hours total for full training with approximately 30 GB GPU memory on a single NVIDIA A6000.

3. Evaluation Metrics and Benchmarks

Performance is evaluated using the Percentage Gap (PG):

$PG = 100 \times \frac{M_{alg} - M_{ub}}{M_{ub}}$

where $M_{alg}$ is the makespan produced by the algorithm, and $M_{ub}$ is the best known or upper-bound makespan for the instance. Lower PG indicates higher solution quality.

The LLM-based system is compared against classic domain heuristics—Priority Dispatching Rules (PDRs) such as Shortest Processing Time (SPT), Most Work Remaining (MWKR), Most Operations Remaining (MOPNR), and Flow Due Date/Weighted KWK (FDD/WKR)—as well as the prior neural method L2D.

4. Empirical Results and Performance Gains

Evaluations on the Tai and DMU benchmarks show consistent, substantial performance gains for LLMs trained on Starjob:

On Taillard’s JSSP testset, the Starjob LLM approach achieves an average PG of 21.69%, compared to L2D’s 29.54%, yielding a relative improvement of 7.85 percentage points.
On the DMU benchmark, the LLM method achieves 22.14% versus 37.50% for L2D, corresponding to a 15.36 percentage point improvement.

These gains are reflected across multiple configurations (see detailed tables “average_gap_tai”, “average_gap_dmu” in the original report), consistently outperforming both PDRs and state-of-the-art neural schedulers. The method also demonstrates robust generalization across problem sizes and configurations.

Benchmark	LLM (Starjob) PG	L2D PG	Improvement
Taillard	21.69%	29.54%	7.85%
DMU	22.14%	37.50%	15.36%

5. Scientific and Methodological Implications

Starjob establishes natural language as a viable modality for encoding and solving complex NP-hard scheduling problems using LLMs. It demonstrates that:

LLMs, when fine-tuned on structured optimization tasks recast as natural language, can learn feasible sequencing, constraint satisfaction, and arithmetic, bridging discrete optimization and human-interpretable output.
The summation-based solution format enables models to natively handle multistep arithmetic, reducing typical hallucination and consistency issues observed in earlier generative solvers.
The integration of problem and solution in tokenized form, combined with LoRA-based adaptation, enables efficient training even on large, quantized models using moderate compute resources.

This work marks a departure from purely algorithmic or neural combinatorial optimization methods, introducing transparency, flexibility, and potential for user interaction in scheduling tasks. The approach facilitates explainability and opens the possibility for user-in-the-loop improvements or interactive constraint negotiation.

6. Future Directions and Extensions

Several pathways for future research are highlighted:

Exploration of enhanced sampling methods, such as Monte Carlo-based techniques, to improve solution diversity and reduce mode collapse in LLM-generated schedules.
Investigation of alternative or larger LLM architectures for improved scalability and complex problem handling.
Potential integration with reinforcement learning or Graph Neural Network (GNN) technologies to further refine policy and incorporate explicit graph-structural priors.
Systematic use of prompt engineering and arithmetic reasoning strategies to further mitigate output inconsistencies and hallucinations.

A plausible implication is that such approaches may generalize to other combinatorial optimization domains—extending LLM-driven optimization from JSSP toward a broader class of NP-hard problems where structure and language co-occur.

7. Significance and Outlook

Starjob represents a pioneering step in leveraging natural language and LLM reasoning for industrial-scale combinatorial optimization. Its contributions include a large, diverse, and precisely formatted dataset, an efficient fine-tuning framework, explicit arithmetic-friendly solutions, and demonstrated performance exceeding both traditional and neural scheduling methods. The dataset and methodology set a foundation for subsequent exploration of language-based approaches to scheduling, interactive optimization, and AI system transparency within highly structured operational research contexts (Abgaryan et al., 26 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

Starjob: Dataset for LLM-Driven Job Shop Scheduling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Starjob Dataset.