Distributed Seasonal Temporal Pattern Mining
- DSTPM is a distributed framework that efficiently mines frequent seasonal temporal patterns from large time series datasets characterized by periodic or bursty events.
- It introduces novel seasonality-sensitive support measures and memory-efficient data structures to overcome the limitations of traditional frequent pattern mining approaches.
- Empirical evaluations demonstrate 4ā5Ć runtime reductions and near-linear scalability across clusters, highlighting its practical impact on high-volume time series analytics.
Distributed Seasonal Temporal Pattern Mining (DSTPM) is the first distributed framework for mining frequent seasonal temporal patterns (STPs) from massive time series datasets. STPs are temporally ordered patterns characterized by periodic or bursty re-occurrence, as seen across domains such as IoT sensor flows and epidemiological surveillance. Classic frequent pattern mining approaches are unsuitable for STPs, as measures like support and confidence cannot distinguish uniform from seasonal clustering, and anti-monotonicity does not hold. DSTPM introduces new formal definitions, memory-efficient distributed data structures, and theoretically sound pruning routines, providing significant efficiency and scalability gains over sequential baselines (Ho-Long et al., 15 Nov 2025).
1. Formal Framework and Problem Definitions
A time series over a domain with granularity is modeled as a sequence , , where is a finite alphabet. Each symbol forms a temporal event marking instances where equals over interval .
Temporal patterns are formalized using Allenās interval relations (follows, contains, overlaps). A pattern of length is defined as:
Classical support, , quantifies occurrence frequency, but fails to distinguish concentrated, periodic (seasonal) occurrences from uniform ones. Additionally, anti-monotonicityāif is infrequent then super-patterns of are infrequentādoes not hold for seasonal counts due to situations where subsets may have fewer detected seasons than supersets.
DSTPM defines a novel, seasonality-sensitive support:
- The support set is the granular timestamps with occurrences of .
- A near-support set is a maximal contiguous subsequence of granules where inter-gap does not exceed a threshold .
- A season is a near-support set with density at least .
- For to be frequent seasonal: , and inter-season intervals respect .
To permit effective pruning, DSTPM proposes the anti-monotonic proxy: $\maxSeason(P) = \frac{|\mathrm{SUP}^P|}{\mathrm{minDensity}}$ ensuring $\maxSeason(P') \ge \maxSeason(P)$ for , so one can prune if $\maxSeason(P) < \mathrm{minSeason}$.
2. Distributed Architecture and Data Partitioning
DSTPM operates atop Spark or other MapReduce engines over a cluster of worker nodes. The input temporal sequence database is partitioned by time-granule or event-symbol so that each worker stores a disjoint fragment of data. This enables linear scalability as each node processes only its local partition for candidate generation, support calculation, and pattern verification (Ho-Long et al., 15 Nov 2025).
The core distributed data structure is the Distributed Hierarchical Lookup Hash (), composed for pattern size as follows:
| Table | Maps From | To |
|---|---|---|
| Instances of | ||
| Pattern | ||
| Relation-supporting instances |
For , reduces to event and granule-instance tables. For , three-level indirection supports both efficient candidate assembly and support set computation.
Each worker maintains only its assigned hash table fragments, reducing overhead and enabling parallel candidate support lookups.
3. Core Algorithms and Pruning Strategies
The DSTPM process operates as follows:
1 2 3 4 5 6 |
Function DSTPM(D_SEQ, maxPeriod, minDensity, distInterval, minSeason)
(EH_1, GH_1) = MineSingleEvents(D_SEQ)
For k = 2 to k_max
(EH_k, PH_k, GH_k) = MineKPatterns(EH_{k-1}, EH_1)
Return all { P | seasons(P) >= minSeason }
End |
Single-Event Mining: Each record emits its event key; a distributed ReduceByKey operation aggregates all instances and calculates $\maxSeason(\omega)$. Events passing the $\maxSeason$ threshold are stored. Survivors are post-filtered on seasonality criteriaāa second pass builds near-support sets and filters by density and recurrence.
Pattern Mining (k > 1):
- Candidates are built via the Cartesian product , but only retained if their $\maxSeason$ proxy meets the minimum.
- Pattern assembly requires, for each event group, assembling valid relation-sets by joining size- frequent patterns with the new event, relying on for 2-event relation supports.
- The support for each candidate is computed as an intersection of all .
- Pruning occurs as soon as the intermediate $\maxSeason$ drops below , eliminating infeasible super-patterns early.
Insertion, lookups, and pruning within are designed for constant or time.
4. Theoretical Foundations and Complexity Analysis
Let be the total number of granules, the number of unique event symbols, the count of candidate -event groups, the count of frequent -event patterns, , and the number of workers.
- Time Complexity:
- Single-event mining takes .
- For -patterns, each of candidates requires:
- Set-intersection:
- lookups in
- Support evaluation:
- Final complexity across all workers:
Space Complexity:
- Per worker
- Cluster-wide memory is linear in total candidate/support set sizes.
- Anti-Monotonicity (Pruning):
- By using the $\maxSeason$ proxy, downward closure is restored: $P' \subseteq P \implies \maxSeason(P') \ge \maxSeason(P)$, so candidates can be eliminated safely.
5. Empirical Evaluation and Scalability
DSTPM has been tested on varied real-world datasets:
- RE (Renewable Energy): 1,460 granules, 21 sensors, 102 symbols, monthly seasonality.
- SC (Smart City Traffic): granules, 30 streams, 150 symbols, daily/weekly seasonality.
- INF (Influenza): 2,628 daily granules, 6 variables, 32 symbols.
Parameter sweeps included , , and .
DSTPM was evaluated against an adapted sequential PS-growth (APS) baseline that mines itemsets and then assembles relations. Metrics included runtime (s), memory (MB), and speedup ():
| Dataset | DSTPM (time, mem) | APS (time, mem) | Speedup |
|---|---|---|---|
| RE | 1,526 s, 5,500 MB | 6,059 s, 11,595 MB | 3.97Ć |
| SC | 1,332 s, 3,832 MB | 5,501 s, 8,183 MB | 4.13Ć |
| INF | 1,114 s, 3,210 MB | 4,754 s, 7,102 MB | 4.27Ć |
DSTPM demonstrates 4ā5Ć reduction in runtime and 2.3Ć reduction in peak memory on average.
Scalability experiments on synthetic datasets ( granules per series) show nearly linear speedup up to at least 15ā20 cluster nodes, with 20 partitions effectively utilizing a 16-node cluster and providing a 12Ć runtime reduction over single-node execution.
6. Significance and Impact
DSTPM resolves central bottlenecks in large-scale seasonal temporal pattern mining by introducing a distributed, partitioned, and hash-based infrastructure, along with a mathematically justified, anti-monotonic, prunable seasonality proxy. This allows previously intractable problem sizes to be handled efficiently, both in memory and computation time. The framework supports flexible deployment on commodity MapReduce platforms and can handle the exponential combinatorial explosion typical in temporal pattern mining as dataset size and event vocabulary grow. The empirical demonstration of nearly linear scaling and significant resource reduction suggests broad applicability for domains reliant on high-bandwidth, seasonality-driven time series analytics (Ho-Long et al., 15 Nov 2025).