Papers
Topics
Authors
Recent
2000 character limit reached

Can LLMs Forecast Internet Traffic from Social Media? (2509.20123v1)

Published 24 Sep 2025 in cs.NI

Abstract: Societal events shape the Internet's behavior. The death of a prominent public figure, a software launch, or a major sports match can trigger sudden demand surges that overwhelm peering points and content delivery networks. Although these events fall outside regular traffic patterns, forecasting systems still rely solely on those patterns and therefore miss these critical anomalies. Thus, we argue for socio-technical systems that supplement technical measurements with an active understanding of the underlying drivers, including how events and collective behavior shape digital demands. We propose traffic forecasting using signals from public discourse, such as headlines, forums, and social media, as early demand indicators. To validate our intuition, we present a proof-of-concept system that autonomously scrapes online discussions, infers real-world events, clusters and enriches them semantically, and correlates them with traffic measurements at a major Internet Exchange Point. This prototype predicted between 56-92% of society-driven traffic spikes after scraping a moderate amount of online discussions. We believe this approach opens new research opportunities in cross-domain forecasting, scheduling, demand anticipation, and society-informed decision making.

Summary

  • The paper demonstrates an LLM-driven approach that uses online discourse to predict major traffic spikes.
  • It integrates time-series baselines with semantic clustering to achieve a prediction rate of 57–92% across major CDNs.
  • The study highlights the potential for socio-technical systems to enhance proactive network management and event-driven forecasting.

Can LLMs Forecast Internet Traffic from Social Media?

Introduction

The paper "Can LLMs Forecast Internet Traffic from Social Media?" explores the potential of utilizing LLMs to predict Internet traffic by analyzing social media and online discussions. Traditional systems often rely on historical patterns to forecast Internet traffic, but they fail to account for spikes caused by unexpected societal events. This paper proposes an innovative approach to traffic forecasting, wherein socio-technical systems employ signals from public discourse to predict demand surges. Figure 1

Figure 1: Traffic forecasters cannot predict pattern-breaking event-driven spikes. Each plot shows measured traffic (solid) versus statistical forecasts (dotted) for major CDNs. Z-score color bands indicate the magnitude of deviation.

Motivation and Background

The challenge of predicting Internet traffic surges linked to real-world events, such as the death of a public figure or major sports matches, is outlined. These event-driven spikes often overwhelm content delivery networks (CDNs), leading to operational disruptions. Traditional forecasting models rely heavily on historical data, lacking the real-time context necessary to anticipate non-recurring societal events.

Despite the frequent occurrence of significant traffic spikes (Figure 2), these events are often unpredicted due to their reliance on calendar-based forecasting models that lack cultural context. To address this, the paper suggests utilizing online discussions as indicators of upcoming traffic-driving events, leveraging the deep semantic understanding capabilities of LLMs. Figure 2

Figure 2: Significant traffic spikes are common, most of which are driven by real-world events. We show the spike frequency for a set of major (anonymized) CDNs.

Proposed Approach

System Overview

The proposed system employs a modular pipeline to parse online discourse, infer real-world events, and predict their digital footprint (Figure 3). This approach integrates LLMs to autonomously scrape online content, clustering and enriching them semantically into structured abstractions optimized for spike prediction. Figure 3

Figure 3: An overview of our approach.

Key Components

  1. Forecasting Baseline: Utilizes standard time-series models to establish baseline network traffic patterns, isolating unexpected spikes by subtracting observed traffic from predicted baselines.
  2. Online Data Collection: Gathers public discourse from platforms like Reddit, automatically cleaning and compiling content into event-driven records for further analysis.
  3. Event Inference and Enrichment: Events are extracted and enriched with metadata using LLMs, allowing the system to deduplicate redundancies and categorize events through semantic clustering.
  4. Spike-Event Correlation: Connects event abstractions with contextual forecasting models, facilitating spike prediction through multi-level semantic analysis.

Proof of Concept

The paper presents a proof-of-concept implementation demonstrating the feasibility of using social media and online discussions for traffic forecasting. The prototype predicted 57–92% of major traffic spikes across CDNs based on events inferred from unstructured online chatter (Figure 4). This success validates the proposition that online discourse contains significant signals for forecasting societal event-induced traffic surges. Figure 4

Figure 4: Fraction of manually investigated traffic spikes that were successfully predicted in our prototype, across three CDNs.

Evaluation

A systematic evaluation reveals that many traffic-driving events are discussed well in advance (Figure 5), highlighting the potential for accurate spike forecasting. This analysis underscores the richness and diversity of signals available in general-purpose online platforms, supporting proactive network management. Figure 5

Figure 5: Cumulative fraction of event mentions by time-to-event, grouped by category. Many events are discussed well in advance, while context and engagement often intensify closer to the event date.

Discussion

The incorporation of socio-aware systems poses challenges and opportunities. Ethical considerations in the extent of surveillance on public discourse are discussed. Moreover, the applicability of these methods extends beyond traffic forecasting to proactive CDN management, misinformation detection, and trust-aware content ranking.

Conclusion

The paper highlights the intertwined nature of societal dynamics and Internet traffic, advocating for socio-technical forecasting systems informed by public discourse. While challenges remain in mapping events to their digital impact, the prototype demonstrates the feasibility of predictive systems that anticipate user behavior from online chatter.

Anticipatory infrastructure systems, which proactively adapt to societal events, have the potential to revolutionize network resource management, offering significant operational benefits and opening new avenues for research in cross-domain forecasting and event impact assessment.

In conclusion, the paper provides a robust foundation for future exploration into the use of LLMs in socio-technical systems, advocating for a deeper integration of societal context into digital infrastructure management.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube