- The paper presents a novel approach that detects emerging topics by analyzing anomalies in user mention behavior across social media streams.
- The methodology uses Bayesian predictive models for mention frequency and Chinese Restaurant Process principles to handle new users, combined with SDNML-based change-point detection.
- The approach outperforms traditional keyword methods by enabling earlier detection of events, particularly when relevant keywords are ambiguous or slow to emerge.
This paper introduces a method for detecting emerging topics in social media streams, like Twitter, by analyzing anomalies in user mentioning behavior rather than relying solely on textual content. The core idea is that the emergence of a new topic often causes users to change who they mention, reply to, or retweet, creating detectable deviations from their normal patterns. This link-based approach offers advantages over traditional term-frequency methods, particularly when topics involve non-textual content (images, videos, URLs) or when relevant keywords are ambiguous or slow to emerge.
Methodology
The proposed system operates in several stages:
- User Mention Modeling: For each user, a probability model is trained based on their recent history (e.g., posts within the last T=30 days). This model captures the user's typical mentioning behavior. It models two aspects jointly:
- Number of Mentions (k): The probability P(k∣θ) of including k mentions in a post is modeled using a Geometric distribution. A Bayesian predictive distribution P(k∣Tu(t)) is derived using a Beta prior, allowing prediction based on the user u's past n posts in the training window Tu(t).
P(k∣Tu(t))=B(n+α,m+β)B(n+1+α,m+k+β), where m is the total mentions in the training set, and α,β are Beta prior parameters.
- Mentioned Users (V): The probability P(v∣Tu(t)) of mentioning a specific user v is estimated using a method inspired by the Chinese Restaurant Process (CRP) to handle users not seen in the training data gracefully. It assigns probability proportional to past mention frequency (mv) for known users and reserves a probability mass (∝γ) for mentioning new users.
- Known user v: P(v∣Tu(t))=m+γmv
- Any new user: P(new user∣Tu(t))=m+γγ
- Link-Anomaly Score Calculation: For each new post x=(t,u,k,V) by user u at time t, a link-anomaly score s(x) is calculated based on the user's model trained on Tu(t). The score is the negative log-likelihood of the post under the model:
s(x)=−logP(k∣Tu(t))−v∈V∑logP(v∣Tu(t)).
A higher score indicates the post's mention pattern (number of mentions and specific users mentioned) is less likely, hence more anomalous, according to the user's history.
- Score Aggregation: Anomaly scores from all posts across different users are aggregated into discrete time bins of size τ (e.g., 1 minute) to create a single time series sj′:
sj′=τ1ti∈[τ(j−1),τj]∑s(xi).
This reflects the overall level of mention anomaly in the network at time j.
- Change-Point Detection: Two methods are applied to the aggregated anomaly time series sj′ to detect significant shifts indicating topic emergence:
- SDNML-based Change-Point Detection: A sophisticated method using Sequentially Discounting Normalized Maximum Likelihood (SDNML) coding. It employs a two-layer scoring process based on autoregressive (AR) models fitted to the time series (sj′ in the first layer, smoothed scores in the second). It detects changes in the statistical dependency structure of the anomaly scores, aiming to identify non-random shifts. Discounting emphasizes recent data. Parameters include smoothing window κ=15 and AR model order p=30.
- Kleinberg's Burst Detection: A simpler model treating the anomaly scores (potentially thresholded) as events in a stream, detecting periods ("bursts") where the event rate significantly increases, modeled via a hidden state modulating a Poisson process.
- Alerting (with SDNML): Dynamic Threshold Optimization (DTO) is used with the final SDNML scores. DTO adaptively sets a threshold based on the recent distribution of scores, aiming to maintain a constant false alarm rate specified by a parameter ρ (e.g., ρ=0.05). An alarm is raised when the score exceeds the threshold.
Implementation Considerations
- Data Acquisition: Requires access to a social media stream API providing post timestamps, user IDs, and mention information.
- User History: Need to maintain a rolling window of recent posts (e.g., 30 days) for each active user to train their mention models. This implies storage and efficient retrieval.
- Model Parameters: Requires setting hyperparameters like the Beta prior parameters (α,β) for the geometric distribution, the CRP parameter (γ), the aggregation window (τ), SDNML parameters (r, p, κ), and DTO parameters (ρ, NH, λH, rH). The paper provides values used in experiments (α=β=1 implicitly assumed, γ not specified but likely small, τ=1 min, κ=15, p=30, ρ=0.05, etc.).
- Computational Cost: Calculating scores involves per-post lookups and model calculations. Aggregation is straightforward. SDNML involves sequential updates of AR models and their associated statistics (covariance matrices), which can be done efficiently using matrix update formulas (Sherman-Morrison-Woodbury mentioned in the Appendix). Real-time application requires efficient implementation of these steps.
- Scalability: Processing large streams requires distributed processing or efficient single-node implementation, particularly for maintaining user histories and calculating scores rapidly.
Experiments and Results
The method was tested on four real-world Twitter datasets related to specific events ("Job hunting" controversy, "Youtube" video leak, "NASA" arsenic life rumor, "BBC" controversial show). It was compared against keyword-frequency based methods (using DTO or Kleinberg's burst detection), where the "best" keyword was chosen manually after the event.
- The link-anomaly methods (SDNML and Kleinberg) performed comparably to the keyword methods on datasets where a clear keyword defined the topic early on ("Job hunting", "Youtube").
- Crucially, the link-anomaly methods detected the emerging topic significantly earlier than keyword methods on datasets where the defining keywords were ambiguous or emerged later ("NASA", "BBC"). For the "NASA" dataset, link anomalies spiked due to user discussions before the official announcement and widespread use of the keyword "arsenic".
- This demonstrates the practical advantage of the link-anomaly approach: it doesn't require knowing the topic keyword beforehand and can react to behavioral changes that precede widespread keyword adoption.
Practical Implications
This research provides a practical, keyword-independent method for early detection of emerging topics or events in social media.
- Applications: Real-time event detection, trend spotting, monitoring public reactions, identifying breaking news even if related keywords aren't initially known or are ambiguous.
- Advantages: Works with non-textual content, robust to keyword ambiguity, potentially earlier detection than keyword methods.
- Potential Enhancements: Combining link-anomaly scores with content analysis could potentially improve accuracy and reduce false alarms. Scaling the approach for real-time processing of large social streams is a key future direction mentioned.