Papers
Topics
Authors
Recent
2000 character limit reached

Bluesky Dataset Overview

Updated 28 November 2025
  • Bluesky Dataset is a collection of decentralized social network data featuring detailed multilayer graphs and hypergraph representations for nuanced network analysis.
  • It employs diverse data models including pairwise directed networks, temporal multi-networks, and hypergraphs to capture user behavior and dynamic interactions.
  • Publicly accessible and FAIR-compliant, the dataset supports research in network analysis, machine learning, and cross-platform comparative studies.

The Bluesky Dataset refers collectively to a set of publicly accessible, large-scale, and multi-faceted social network data resources generated from the decentralized microblogging platform Bluesky. Enabled by Bluesky’s open AT Protocol and public data policies, these datasets support computational social science, network analysis, and machine learning, offering structured, longitudinal, and granular representations of user behavior, social ties, content creation, and higher-order interactions in a decentralized context. The most significant, high-coverage Bluesky datasets include BlueTempNet, “A Blue Start”, Bluesky scholarly communication corpora, cross-platform comparative datasets, and population-scale interaction or content archives.

1. Dataset Architectures and Data Models

Bluesky datasets employ both standard graph and advanced multi-network/hypergraph representations, leveraging the platform’s feature set:

  • Pairwise directed networks: The canonical representation is G=(V,F)G = (V, F), where VV is the set of users (nodes) and FV×VF \subseteq V \times V is the set of directed “follow” relations. Edge attributes (e.g., timestamps for temporal networks, sign for follow/block, etc.) are integrated as needed (Smith et al., 16 May 2025, Jeong et al., 24 Jul 2024).
  • Higher-order interactions: Bluesky enables group-mediated relations through “starter packs,” which are user-curated collections of accounts (and sometimes feeds). These are formalized as a hypergraph H=(V,E)H = (V, E), with VV as users/feeds and EE a set of hyperedges, each hyperedge corresponding to a starter pack (Smith et al., 16 May 2025).
  • Temporal multi-networks: Datasets like BlueTempNet provide a temporal, multilayer graph,

G=(V,E1,E2,E3)G = (V, E_1, E_2, E_3)

where E1E_1 (follows), E2E_2 (blocks), and E3E_3 (user–feed actions) are time-stamped edge sets, analyzed jointly or per layer (Jeong et al., 24 Jul 2024).

2. Data Acquisition, Anonymization, and Metadata

Bluesky datasets are built by directly querying official APIs or streaming Firehose endpoints:

Dataset Users (n) Edges/Interactions (m) Groupings Content Volume Distinctive Features
BlueTempNet (Jeong et al., 24 Jul 2024) 147k members 5.7M follows, 0.5M blocks 39,968 feeds N/A Temporal, multi-network, ms-level
A Blue Start (Smith et al., 16 May 2025) 26.7M users 1.6B follows 301k starter packs N/A Hypergraphs for higher-order ties
"I'm in the Bluesky…" (Failla et al., 29 Apr 2024) 4.1M users 145M follows, 23M replies, 63M reposts 11 feeds 237M posts Population-level, complete posts
PolitiSky24 (Rostami et al., 9 Jun 2025) 8.5k users 16k (user-target stance pairs) NA 18M posts User-level stance w/ rationale

3. Temporal, Structural, and Higher-Order Data Properties

The Bluesky datasets collectively cover a wide range of user interaction scenarios and address both pairwise and community/group-level network science:

  • Degree distributions: All major interaction networks display heavy-tailed degree distributions, often characterized by fitted power-law exponents (e.g., αout1.44\alpha_{\text{out}} \approx 1.44 for follows) (Quelle et al., 27 May 2024).
  • Clustering and connected components: High (normalized) clustering coefficients (10 to 200 times configuration-model baseline) are observed in follows, replies, and repost layers; giant strongly-connected components dominate, and smaller SCCs exist in both user and starter-pack hypergraphs (Smith et al., 16 May 2025, Failla et al., 29 Apr 2024).
  • Higher-order metrics: Starter-pack hypergraphs permit s-line-graph densities, hypergraph k-cores, user-pair co-occurrence rates across lists, and community entropy quantification (Smith et al., 16 May 2025).
  • Temporal coverage: Datasets range from Feb 2023 (platform launch) to May 2025, with rolling updates and sub-daily event precision in some corpora, supporting fine-grained longitudinal research (Jeong et al., 24 Jul 2024, Failla et al., 29 Apr 2024).

4. Specialized Content, Behavior, and Population Subsets

Several datasets capture distinctive behavioral, topical, or population subsets within Bluesky:

  • Scholarly dissemination: The scholarly communication dataset identifies 87,470 posts referencing DOIs, joined with OpenAlex for bibliometric content/discipline analysis, with derived originality scores based on post/title cosine similarity (Zheng et al., 24 Jul 2025).
  • Political collections and stance: PolitiSky24 and “Politics and polarization on Bluesky” provide both post-level and user-level stance detection for U.S. politics, including millions of stance-annotated posts, interaction graphs, and full pseudonymized posting histories (Rostami et al., 9 Jun 2025, Salloum et al., 3 Jun 2025).
  • Cross-platform records: MADOC integrates standardized Bluesky subsets with Reddit, Koo, Voat across 12 communities, designed for toxic behavior and moderation research (Dankulov et al., 22 Jan 2025).
  • Persona/thread-level behavior prediction: The SocialSim challenge dataset includes 6.4M Bluesky conversation threads, 12 action classes (with rare-action focus), and 25 persona clusters (White et al., 21 Nov 2025).
  • News reliability: The MurkySky dataset resolves news-sharing posts to NewsGuard-rated domains, labeling reliability and extracting hashtag/topic/audience segmentation networks, with observed unreliable content prevalence ≈2% (Reddy et al., 17 Jan 2025).

5. Data Access, File Formats, and Tooling

Consistent with Bluesky’s open-data orientation, most core datasets are distributed under academic or open licenses (e.g., CC-BY 4.0) (Failla et al., 29 Apr 2024, Smith et al., 16 May 2025, Jeong et al., 24 Jul 2024). Key distribution and access conventions:

6. Research Applications and Limitations

Bluesky datasets serve as testbeds for:

7. Significance and Prospects

The Bluesky Dataset family constitutes a unique resource for the network science, computational social science, and machine learning communities:

  • Open architecture: Bluesky’s separation of identity, hosting, indexing, content feeds, and moderation exposes rich, composable data structures, facilitating novel research in decentralized systems (Kleppmann et al., 5 Feb 2024, Balduf et al., 22 Aug 2024).
  • Granularity, scale, and diversity: Millisecond-resolved multilayer graphs, higher-order hypergraph structures, population-scale activity streams, and integration with external knowledge bases make possible analyses previously infeasible on traditional, siloed platforms.
  • Use in benchmarking: These datasets allow development and evaluation of new methods in signed-temporal network analysis, hypergraph learning, stance detection (user- and post-level), rare-action prediction, moderation, diffusion modeling, and migration dynamics.
  • Researcher access and legal/ethical compliance: All major datasets are released under FAIR-compliant terms with open code, documented limitations, robust anonymization, and explicit adherence to Bluesky’s public data policy (Jeong et al., 24 Jul 2024, Failla et al., 29 Apr 2024, Smith et al., 16 May 2025, Dankulov et al., 22 Jan 2025).

The Bluesky Dataset ecosystem is thus central to current methodological advancement in network, content, and group-dynamics research on decentralized, open social platforms.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bluesky Dataset.