Discovering Emerging Topics in Social Streams via Link Anomaly Detection (1110.2899v1)

Published 13 Oct 2011 in stat.ML, cs.LG, cs.SI, and physics.soc-ph

Abstract: Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional term-frequency-based approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamically intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic from the anomaly measured through the model. We combine the proposed mention anomaly score with a recently proposed change-point detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML), or with Kleinberg's burst model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social network posts. We demonstrate our technique in a number of real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as the conventional term-frequency-based approach, and sometimes much earlier when the keyword is ill-defined.

Citations (112)

View on Semantic Scholar

Summary

The paper presents a novel approach that detects emerging topics by analyzing anomalies in user mention behavior across social media streams.
The methodology uses Bayesian predictive models for mention frequency and Chinese Restaurant Process principles to handle new users, combined with SDNML-based change-point detection.
The approach outperforms traditional keyword methods by enabling earlier detection of events, particularly when relevant keywords are ambiguous or slow to emerge.

This paper introduces a method for detecting emerging topics in social media streams, like Twitter, by analyzing anomalies in user mentioning behavior rather than relying solely on textual content. The core idea is that the emergence of a new topic often causes users to change who they mention, reply to, or retweet, creating detectable deviations from their normal patterns. This link-based approach offers advantages over traditional term-frequency methods, particularly when topics involve non-textual content (images, videos, URLs) or when relevant keywords are ambiguous or slow to emerge.

Methodology

The proposed system operates in several stages:

User Mention Modeling: For each user, a probability model is trained based on their recent history (e.g., posts within the last $T=30$ $T = 30$ days). This model captures the user's typical mentioning behavior. It models two aspects jointly:
- Number of Mentions ( $k$ ): The probability $P(k|\theta)$ of including $k$ mentions in a post is modeled using a Geometric distribution. A Bayesian predictive distribution $P(k|T_{u}^{(t)})$ is derived using a Beta prior, allowing prediction based on the user $u$ 's past $n$ posts in the training window $T_u^{(t)}$ . $P(k|T_u^{(t)}) = \frac{B(n+1+\alpha, m+k+\beta)}{B(n+\alpha, m+\beta)}$ , where $m$ is the total mentions in the training set, and $\alpha, \beta$ are Beta prior parameters.
- Mentioned Users ( $V$ ): The probability $P(v|T_{u}^{(t)})$ $P (v ∣ T_{u}^{(t)})$ of mentioning a specific user $v$ $v$ is estimated using a method inspired by the Chinese Restaurant Process (CRP) to handle users not seen in the training data gracefully. It assigns probability proportional to past mention frequency ( $m_v$ $m_{v}$ ) for known users and reserves a probability mass ( $\propto \gamma$ $\propto γ$ ) for mentioning new users.
  - Known user $v$ : $P(v|T_u^{(t)}) = \frac{m_v}{m+\gamma}$
  - Any new user: $P(\text{new user}|T_u^{(t)}) = \frac{\gamma}{m+\gamma}$
Link-Anomaly Score Calculation: For each new post $x=(t, u, k, V)$ by user $u$ at time $t$ , a link-anomaly score $s(x)$ is calculated based on the user's model trained on $T_u^{(t)}$ . The score is the negative log-likelihood of the post under the model: $s(x) = -\log P(k|T_u^{(t)}) - \sum_{v \in V} \log P(v|T_u^{(t)})$ . A higher score indicates the post's mention pattern (number of mentions and specific users mentioned) is less likely, hence more anomalous, according to the user's history.
Score Aggregation: Anomaly scores from all posts across different users are aggregated into discrete time bins of size $\tau$ (e.g., 1 minute) to create a single time series $s'_j$ : $s'_j = \frac{1}{\tau} \sum_{t_i \in [\tau(j-1), \tau j]} s(x_i)$ . This reflects the overall level of mention anomaly in the network at time $j$ .
Change-Point Detection: Two methods are applied to the aggregated anomaly time series $s'_j$ $s_{j}^{'}$ to detect significant shifts indicating topic emergence:
- SDNML-based Change-Point Detection: A sophisticated method using Sequentially Discounting Normalized Maximum Likelihood (SDNML) coding. It employs a two-layer scoring process based on autoregressive (AR) models fitted to the time series ( $s'_j$ in the first layer, smoothed scores in the second). It detects changes in the statistical dependency structure of the anomaly scores, aiming to identify non-random shifts. Discounting emphasizes recent data. Parameters include smoothing window $\kappa=15$ and AR model order $p=30$ .
- Kleinberg's Burst Detection: A simpler model treating the anomaly scores (potentially thresholded) as events in a stream, detecting periods ("bursts") where the event rate significantly increases, modeled via a hidden state modulating a Poisson process.
Alerting (with SDNML): Dynamic Threshold Optimization (DTO) is used with the final SDNML scores. DTO adaptively sets a threshold based on the recent distribution of scores, aiming to maintain a constant false alarm rate specified by a parameter $\rho$ (e.g., $\rho=0.05$ ). An alarm is raised when the score exceeds the threshold.

Implementation Considerations

Data Acquisition: Requires access to a social media stream API providing post timestamps, user IDs, and mention information.
User History: Need to maintain a rolling window of recent posts (e.g., 30 days) for each active user to train their mention models. This implies storage and efficient retrieval.
Model Parameters: Requires setting hyperparameters like the Beta prior parameters ( $\alpha, \beta$ ) for the geometric distribution, the CRP parameter ( $\gamma$ ), the aggregation window ( $\tau$ ), SDNML parameters ( $r$ , $p$ , $\kappa$ ), and DTO parameters ( $\rho$ , $N_H$ , $\lambda_H$ , $r_H$ ). The paper provides values used in experiments ( $\alpha=\beta=1$ implicitly assumed, $\gamma$ not specified but likely small, $\tau=1$ min, $\kappa=15$ , $p=30$ , $\rho=0.05$ , etc.).
Computational Cost: Calculating scores involves per-post lookups and model calculations. Aggregation is straightforward. SDNML involves sequential updates of AR models and their associated statistics (covariance matrices), which can be done efficiently using matrix update formulas (Sherman-Morrison-Woodbury mentioned in the Appendix). Real-time application requires efficient implementation of these steps.
Scalability: Processing large streams requires distributed processing or efficient single-node implementation, particularly for maintaining user histories and calculating scores rapidly.

Experiments and Results

The method was tested on four real-world Twitter datasets related to specific events ("Job hunting" controversy, "Youtube" video leak, "NASA" arsenic life rumor, "BBC" controversial show). It was compared against keyword-frequency based methods (using DTO or Kleinberg's burst detection), where the "best" keyword was chosen manually after the event.

The link-anomaly methods (SDNML and Kleinberg) performed comparably to the keyword methods on datasets where a clear keyword defined the topic early on ("Job hunting", "Youtube").
Crucially, the link-anomaly methods detected the emerging topic significantly earlier than keyword methods on datasets where the defining keywords were ambiguous or emerged later ("NASA", "BBC"). For the "NASA" dataset, link anomalies spiked due to user discussions before the official announcement and widespread use of the keyword "arsenic".
This demonstrates the practical advantage of the link-anomaly approach: it doesn't require knowing the topic keyword beforehand and can react to behavioral changes that precede widespread keyword adoption.

Practical Implications

This research provides a practical, keyword-independent method for early detection of emerging topics or events in social media.

Applications: Real-time event detection, trend spotting, monitoring public reactions, identifying breaking news even if related keywords aren't initially known or are ambiguous.
Advantages: Works with non-textual content, robust to keyword ambiguity, potentially earlier detection than keyword methods.
Potential Enhancements: Combining link-anomaly scores with content analysis could potentially improve accuracy and reduce false alarms. Scaling the approach for real-time processing of large social streams is a key future direction mentioned.

PDF Markdown

Discovering Emerging Topics in Social Streams via Link Anomaly Detection (1110.2899v1)

Summary

Related Papers