MIDAS: Microcluster Anomaly Detection
- MIDAS is a microcluster-based anomaly detection method that uses count–min sketches to monitor massive, evolving graph edge streams.
- It applies an online two-bin chi-squared test and temporal decay to compute anomaly scores with provable upper bounds on false positives.
- Empirical results on datasets like DARPA and TwitterWorldCup demonstrate up to 644× speedup and improved detection accuracy (AUC up to 0.95).
MIDAS
MIDAS is an acronym employed across diverse disciplines to name distinct tools, models, and experiments in data science, machine learning, robotics, astronomy, computer vision, and sensing. Below, major MIDAS variants are organized as independent research contributions, each defined by its core technical achievements, methodologies, and empirical impacts, as found in primary arXiv sources.
1. Microcluster-Based Anomaly Detection in Edge Streams
MIDAS refers to the “Microcluster-Based Detector of Anomalies in Edge Streams”—an online algorithm for detecting anomalous groups of edges (microclusters) in massive, temporally evolving graphs. Unlike methods that solely identify individually rare edges, MIDAS focuses on sudden bursts of similar edge activity, as in lockstep or denial-of-service attacks (Bhatia et al., 2019).
Problem Definition and Microcluster Anomalies
A microcluster is defined as a “suddenly arriving group of suspiciously similar edges,” for example, a burst of the same or related pairs within a short time tick. The detection objective is not the surprise of a single edge, but the collective deviation from historical behavior: for each (or around node /), MIDAS tests if the edge count at the current tick greatly exceeds the expectation under the historic per-tick rate.
Algorithmic Structure
- Data Structures: MIDAS maintains two Count–Min Sketches (CMS): one for the total counts over all ticks, and one for the current-tick counts at time . All CMSs operate with time per edge (constant space per edge), with width , depth fixed.
- Online Anomaly Score: The system executes a streaming, two-bin chi-squared test. At time 0 for 1, the expected count is 2. The score is
3
where 4, 5 are CMS estimates.
- Extension – MIDAS-R: Adds temporal decay (6-discounting between ticks) and computes per-node CMS and scores for spatial relations, assigning to each edge the maximum anomaly score over edge and endpoint nodes.
Theoretical Guarantees
MIDAS provides a provable upper bound 7 on the probability of false positives. For bias-corrected count 8, the adjusted test statistic 9 satisfies
0
where 1 is the corresponding quantile of the chi-squared distribution.
Empirical Results
Evaluations on real network and event-stream datasets, including DARPA Intrusion, TwitterSecurity, and TwitterWorldCup, demonstrate the following:
| Dataset | AUC (SedanSpot) | AUC (MIDAS) | Time (SedanSpot, s) | Time (MIDAS, s) | AUC Gain | Speedup |
|---|---|---|---|---|---|---|
| DARPA | 0.64 | 0.91 | 84 | 0.13 | +42% | 644× |
| DARPA (MIDAS-R) | - | 0.95 | - | 0.39 | +48% | 215× |
| TwitterWorldCup | - | - | 27.58 | 0.06 | - | 460× |
Average precision improvements closely match AUC gains. MIDAS-R’s anomaly