ProcessTransformer: Predictive Process Monitoring

Updated 5 January 2026

ProcessTransformer is a Transformer-based model for predictive business process monitoring, forecasting next activities, event timestamps, and remaining times from event logs.
It employs self-attention and positional encoding to capture long-range dependencies, overcoming the limitations of recurrent models and enhancing computational efficiency.
Evaluations on real-world datasets demonstrate its state-of-the-art performance with higher accuracy in next-event prediction and remaining time estimation compared to traditional baselines.

ProcessTransformer is a Transformer-based architecture designed for predictive business process monitoring, specifically targeting the tasks of next-activity prediction, next-event timestamp prediction, and remaining-time estimation from event logs. Leveraging self-attention and long-range context modeling, ProcessTransformer overcomes key limitations of recurrent and shallow deep learning approaches, particularly in handling lengthy process traces and capturing dependencies across distant events. ProcessTransformer achieves state-of-the-art results on a collection of real-world process mining benchmarks, demonstrating both superior predictive accuracy and computational efficiency (Bukhsh et al., 2021).

1. Model Architecture

ProcessTransformer adapts the standard Transformer encoder for event log data. The input to the model consists of sequences of process events, each carrying categorical activity labels and timestamp-derived temporal features. The primary architectural elements include embedding layers, sinusoidal positional encodings, stacked self-attention blocks, position-wise feed-forward layers, and task-specific output heads.

Input Representation

Activity Label Embedding: Each event's categorical label is represented as a one-hot vector $x \in \{0,1\}^V$ ( $V$ is vocabulary size), mapped to a dense vector $v_{\text{act}} = E^{act\,T} x \in \mathbb{R}^{d_{\text{emb}}}$ using $E^{act} \in \mathbb{R}^{V \times d_{\text{emb}}}$ , with $d_{\text{emb}}=36$ .
Temporal Features: Three real-valued time-deltas are computed for each prefix $\sigma'$ , formally: $fv_{t1} = \pi_T(e_k) - \pi_T(e_{k-1})$ , $fv_{t2} = \pi_T(e_k) - \pi_T(e_{k-2})$ , $fv_{t3} = \pi_T(e_k) - \pi_T(e_1)$ , concatenated into $f_t \in \mathbb{R}^{d_t}$ .
Positional Encoding: Sinusoidal encoding $PE \in \mathbb{R}^{L_{\max} \times d_{\text{emb}}}$ is added to $v_{\text{act}}$ to maintain order information, with $PE(pos,2i)=\sin(pos/10000^{2i/d_{\text{emb}}})$ and $PE(pos,2i+1)=\cos(pos/10000^{2i/d_{\text{emb}}})$ for $i \in [1, d_{\text{emb}}/2]$ .
Final Input Vector: $x_i = [\,E^{act\,T} x_i^{act} + PE(i) ; f_t(i)\,] \in \mathbb{R}^{d_{\text{model}}}$ .

Self-Attention and Feed-Forward Layers

The core of ProcessTransformer is the multi-head self-attention mechanism:

Projection: For sequence $X \in \mathbb{R}^{n \times d_{\text{model}}}$ , compute queries, keys, values: $Q = XW^Q$ , $K = XW^K$ , $V = XW^V$ .
Scaled Attention: $Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ .
Multi-Head: $MultiHead(Q,K,V) = Concat(head_1,\ldots,head_h) W^O$ .
Residuals & Normalization: $Y = LayerNorm(X + MultiHead(X,X,X))$ , $Z = LayerNorm(Y + FFN(Y))$ .
Feed-Forward (per position): $FFN(x) = \max(0, xW_1+b_1)W_2 + b_2$ .

Output Heads

After global max pooling of the sequence outputs, three task-specific heads are used:

Head	Output	Loss Function
Next-activity prediction	Categorical logits	Categorical cross-entropy
Next-event timestamp	Scalar	Regression (MSE/Log-cosh)
Remaining-time prediction	Scalar	Regression (MSE)

Each head is trained independently. In the multi-task setting, joint minimization $\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{\rm act} + \lambda_2 \mathcal{L}_{\rm time} + \lambda_3 \mathcal{L}_{\rm rem}$ is supported but not employed in the reported experiments.

2. Training Procedure

For all tasks, ProcessTransformer is trained independently for 100 epochs with the Adam optimizer ( $\beta_1=0.9, \beta_2=0.999$ ) and learning rate $10^{-2}$ . Dropout (0.1) is applied after each attention and feed-forward sub-layer to reduce overfitting. Batch size and additional hyperparameters follow standard practice but are not explicitly reported. Training is parallelizable and achieves faster convergence relative to LSTM-based models due to the non-sequential computation enabled by self-attention.

3. Datasets and Baselines

ProcessTransformer is evaluated on nine real-world event logs from the 4TU Repository, including:

Helpdesk (Italian help-desk log): 4,580 cases, 14 activities, avg. trace 4.66
BPI Challenge 2012 (loan application): 13,087 cases, 24 activities, avg. length 20.0
BPI12w / BPI12cw (variants), BPI Challenge 2013 (incident management), BPI Challenge 2020 (declaration processes)
Hospital billing: 100,000 cases, 18 activities, avg. length 4.5
Traffic fines: 150,370 cases, 11 activities, avg. length 3.7

Baseline models encompass shallow LSTM (Evermann et al.), LSTM+one-hot (Tax et al.), memory-augmented NNs (Khan et al.), LSTM with attributes (Camargo et al.), inception-CNN (Di Mauro et al.), and standard CNN (Pasquadibisceglie et al.) for next-activity; LSTM and CNN variants for timestamp and remaining-time prediction.

4. Evaluation Metrics and Results

Task performance is assessed using:

Next-activity: Weighted accuracy and weighted $F_1$ (to address class imbalance), $F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision}+\text{recall}}$ .
Next-time/remaining-time: Mean Absolute Error (MAE), $\mathrm{MAE} = \frac{1}{N} \sum_i |y_i-\hat y_i|$ .

Key quantitative results are:

Next-activity: Achieves 80.6% average accuracy across prefixes (range: 62.1%–91.5%) compared to best baselines (76–84%); for Helpdesk dataset: 85.6% (versus 75–77%).
Next-event time: MAE ≈ 1.08 days (best baseline ≈ 1.65 days).
Remaining time: MAE ≈ 5.33 days (best baseline ≈ 6.5 days).

A qualitative analysis indicates that self-attention allows the model to capture long-range dependencies without the gradient decay issues associated with recurrent architectures. Attention weights dynamically focus on task-relevant portions of the trace, facilitating accurate predictions even in long or complex cases.

5. Architectural and Methodological Insights

ProcessTransformer replaces recurrence with self-attention, which enables the learning of dependencies among events at any distance in the trace. Sinusoidal positional encoding injects order information essential for sequential reasoning; omitting this, in principle, would impede performance by erasing episode structure. The use of multiple attention heads allows capturing diverse dependency types in parallel; a reduction in the number of heads or blocks would plausibly degrade capacity for modeling complex trace structures.

Parallel computation at both training and inference stages confers efficiency gains over LSTM-based architectures. The model architecture is general and task-agnostic; only the output heads are task-specific, with possible extension to multi-task joint training though this is not evaluated.

ProcessTransformer is situated within the broader domain of process mining, where the objective is to predict ongoing process dynamics from historical event logs for applications such as resource allocation and process optimization. Preceding neural approaches—LSTMs, CNNs, and memory-augmented architectures—were limited in their ability to handle both long-range dependencies and long traces due to vanishing gradients or fixed-size feature extraction.

By leveraging the Transformer paradigm, ProcessTransformer addresses these deficits and experimentally demonstrates superior performance and scalability on standard benchmarks. A plausible implication is that attention-based methods will become the dominant approach for process monitoring tasks that require both deep history integration and high throughput.

No formal ablation studies are presented; however, analogous to other Transformer-based models, order signals (positional encoding) and sufficient self-attention capacity are crucial to maintaining performance.

7. Significance and Prospective Directions

ProcessTransformer substantiates the efficacy of attention-based models for predictive business process monitoring, outperforming prior state-of-the-art on a range of key datasets and tasks. Its modular design enables extension to other event sequence prediction problems where long-range context is relevant. Potential future directions include joint multi-task training, integration of richer temporal or attribute information, and domain adaptation to other sequential process domains. Enhanced interpretability of attention weights and scaling to larger, more heterogeneous logs are further areas for investigation (Bukhsh et al., 2021).

Markdown Upgrade to Chat

References (1)

ProcessTransformer: Predictive Business Process Monitoring with Transformer Network (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ProcessTransformer.

ProcessTransformer: Predictive Process Monitoring

1. Model Architecture

Input Representation

Self-Attention and Feed-Forward Layers

Output Heads

2. Training Procedure

3. Datasets and Baselines

4. Evaluation Metrics and Results

5. Architectural and Methodological Insights

7. Significance and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ProcessTransformer: Predictive Process Monitoring

1. Model Architecture

Input Representation

Self-Attention and Feed-Forward Layers

Output Heads

2. Training Procedure

3. Datasets and Baselines

4. Evaluation Metrics and Results

5. Architectural and Methodological Insights

6. Context, Related Work, and Implications

7. Significance and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research