Papers
Topics
Authors
Recent
2000 character limit reached

ProcessTransformer: Predictive Process Monitoring

Updated 5 January 2026
  • ProcessTransformer is a Transformer-based model for predictive business process monitoring, forecasting next activities, event timestamps, and remaining times from event logs.
  • It employs self-attention and positional encoding to capture long-range dependencies, overcoming the limitations of recurrent models and enhancing computational efficiency.
  • Evaluations on real-world datasets demonstrate its state-of-the-art performance with higher accuracy in next-event prediction and remaining time estimation compared to traditional baselines.

ProcessTransformer is a Transformer-based architecture designed for predictive business process monitoring, specifically targeting the tasks of next-activity prediction, next-event timestamp prediction, and remaining-time estimation from event logs. Leveraging self-attention and long-range context modeling, ProcessTransformer overcomes key limitations of recurrent and shallow deep learning approaches, particularly in handling lengthy process traces and capturing dependencies across distant events. ProcessTransformer achieves state-of-the-art results on a collection of real-world process mining benchmarks, demonstrating both superior predictive accuracy and computational efficiency (Bukhsh et al., 2021).

1. Model Architecture

ProcessTransformer adapts the standard Transformer encoder for event log data. The input to the model consists of sequences of process events, each carrying categorical activity labels and timestamp-derived temporal features. The primary architectural elements include embedding layers, sinusoidal positional encodings, stacked self-attention blocks, position-wise feed-forward layers, and task-specific output heads.

Input Representation

  • Activity Label Embedding: Each event's categorical label is represented as a one-hot vector x{0,1}Vx \in \{0,1\}^V (VV is vocabulary size), mapped to a dense vector vact=EactTxRdembv_{\text{act}} = E^{act\,T} x \in \mathbb{R}^{d_{\text{emb}}} using EactRV×dembE^{act} \in \mathbb{R}^{V \times d_{\text{emb}}}, with demb=36d_{\text{emb}}=36.
  • Temporal Features: Three real-valued time-deltas are computed for each prefix σ\sigma', formally: fvt1=πT(ek)πT(ek1)fv_{t1} = \pi_T(e_k) - \pi_T(e_{k-1}), fvt2=πT(ek)πT(ek2)fv_{t2} = \pi_T(e_k) - \pi_T(e_{k-2}), fvt3=πT(ek)πT(e1)fv_{t3} = \pi_T(e_k) - \pi_T(e_1), concatenated into ftRdtf_t \in \mathbb{R}^{d_t}.
  • Positional Encoding: Sinusoidal encoding PERLmax×dembPE \in \mathbb{R}^{L_{\max} \times d_{\text{emb}}} is added to vactv_{\text{act}} to maintain order information, with PE(pos,2i)=sin(pos/100002i/demb)PE(pos,2i)=\sin(pos/10000^{2i/d_{\text{emb}}}) and PE(pos,2i+1)=cos(pos/100002i/demb)PE(pos,2i+1)=\cos(pos/10000^{2i/d_{\text{emb}}}) for i[1,demb/2]i \in [1, d_{\text{emb}}/2].
  • Final Input Vector: xi=[EactTxiact+PE(i);ft(i)]Rdmodelx_i = [\,E^{act\,T} x_i^{act} + PE(i) ; f_t(i)\,] \in \mathbb{R}^{d_{\text{model}}}.

Self-Attention and Feed-Forward Layers

The core of ProcessTransformer is the multi-head self-attention mechanism:

  • Projection: For sequence XRn×dmodelX \in \mathbb{R}^{n \times d_{\text{model}}}, compute queries, keys, values: Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V.
  • Scaled Attention: Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V.
  • Multi-Head: MultiHead(Q,K,V)=Concat(head1,,headh)WOMultiHead(Q,K,V) = Concat(head_1,\ldots,head_h) W^O.
  • Residuals & Normalization: Y=LayerNorm(X+MultiHead(X,X,X))Y = LayerNorm(X + MultiHead(X,X,X)), Z=LayerNorm(Y+FFN(Y))Z = LayerNorm(Y + FFN(Y)).
  • Feed-Forward (per position): FFN(x)=max(0,xW1+b1)W2+b2FFN(x) = \max(0, xW_1+b_1)W_2 + b_2.

Output Heads

After global max pooling of the sequence outputs, three task-specific heads are used:

Head Output Loss Function
Next-activity prediction Categorical logits Categorical cross-entropy
Next-event timestamp Scalar Regression (MSE/Log-cosh)
Remaining-time prediction Scalar Regression (MSE)

Each head is trained independently. In the multi-task setting, joint minimization Ltotal=λ1Lact+λ2Ltime+λ3Lrem\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{\rm act} + \lambda_2 \mathcal{L}_{\rm time} + \lambda_3 \mathcal{L}_{\rm rem} is supported but not employed in the reported experiments.

2. Training Procedure

For all tasks, ProcessTransformer is trained independently for 100 epochs with the Adam optimizer (β1=0.9,β2=0.999\beta_1=0.9, \beta_2=0.999) and learning rate 10210^{-2}. Dropout (0.1) is applied after each attention and feed-forward sub-layer to reduce overfitting. Batch size and additional hyperparameters follow standard practice but are not explicitly reported. Training is parallelizable and achieves faster convergence relative to LSTM-based models due to the non-sequential computation enabled by self-attention.

3. Datasets and Baselines

ProcessTransformer is evaluated on nine real-world event logs from the 4TU Repository, including:

  • Helpdesk (Italian help-desk log): 4,580 cases, 14 activities, avg. trace 4.66
  • BPI Challenge 2012 (loan application): 13,087 cases, 24 activities, avg. length 20.0
  • BPI12w / BPI12cw (variants), BPI Challenge 2013 (incident management), BPI Challenge 2020 (declaration processes)
  • Hospital billing: 100,000 cases, 18 activities, avg. length 4.5
  • Traffic fines: 150,370 cases, 11 activities, avg. length 3.7

Baseline models encompass shallow LSTM (Evermann et al.), LSTM+one-hot (Tax et al.), memory-augmented NNs (Khan et al.), LSTM with attributes (Camargo et al.), inception-CNN (Di Mauro et al.), and standard CNN (Pasquadibisceglie et al.) for next-activity; LSTM and CNN variants for timestamp and remaining-time prediction.

4. Evaluation Metrics and Results

Task performance is assessed using:

  • Next-activity: Weighted accuracy and weighted F1F_1 (to address class imbalance), F1=2precisionrecallprecision+recallF_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision}+\text{recall}}.
  • Next-time/remaining-time: Mean Absolute Error (MAE), MAE=1Niyiy^i\mathrm{MAE} = \frac{1}{N} \sum_i |y_i-\hat y_i|.

Key quantitative results are:

  • Next-activity: Achieves 80.6% average accuracy across prefixes (range: 62.1%–91.5%) compared to best baselines (76–84%); for Helpdesk dataset: 85.6% (versus 75–77%).
  • Next-event time: MAE ≈ 1.08 days (best baseline ≈ 1.65 days).
  • Remaining time: MAE ≈ 5.33 days (best baseline ≈ 6.5 days).

A qualitative analysis indicates that self-attention allows the model to capture long-range dependencies without the gradient decay issues associated with recurrent architectures. Attention weights dynamically focus on task-relevant portions of the trace, facilitating accurate predictions even in long or complex cases.

5. Architectural and Methodological Insights

ProcessTransformer replaces recurrence with self-attention, which enables the learning of dependencies among events at any distance in the trace. Sinusoidal positional encoding injects order information essential for sequential reasoning; omitting this, in principle, would impede performance by erasing episode structure. The use of multiple attention heads allows capturing diverse dependency types in parallel; a reduction in the number of heads or blocks would plausibly degrade capacity for modeling complex trace structures.

Parallel computation at both training and inference stages confers efficiency gains over LSTM-based architectures. The model architecture is general and task-agnostic; only the output heads are task-specific, with possible extension to multi-task joint training though this is not evaluated.

ProcessTransformer is situated within the broader domain of process mining, where the objective is to predict ongoing process dynamics from historical event logs for applications such as resource allocation and process optimization. Preceding neural approaches—LSTMs, CNNs, and memory-augmented architectures—were limited in their ability to handle both long-range dependencies and long traces due to vanishing gradients or fixed-size feature extraction.

By leveraging the Transformer paradigm, ProcessTransformer addresses these deficits and experimentally demonstrates superior performance and scalability on standard benchmarks. A plausible implication is that attention-based methods will become the dominant approach for process monitoring tasks that require both deep history integration and high throughput.

No formal ablation studies are presented; however, analogous to other Transformer-based models, order signals (positional encoding) and sufficient self-attention capacity are crucial to maintaining performance.

7. Significance and Prospective Directions

ProcessTransformer substantiates the efficacy of attention-based models for predictive business process monitoring, outperforming prior state-of-the-art on a range of key datasets and tasks. Its modular design enables extension to other event sequence prediction problems where long-range context is relevant. Potential future directions include joint multi-task training, integration of richer temporal or attribute information, and domain adaptation to other sequential process domains. Enhanced interpretability of attention weights and scaling to larger, more heterogeneous logs are further areas for investigation (Bukhsh et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ProcessTransformer.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube