Bluesky Dataset Overview
- Bluesky Dataset is a collection of decentralized social network data featuring detailed multilayer graphs and hypergraph representations for nuanced network analysis.
- It employs diverse data models including pairwise directed networks, temporal multi-networks, and hypergraphs to capture user behavior and dynamic interactions.
- Publicly accessible and FAIR-compliant, the dataset supports research in network analysis, machine learning, and cross-platform comparative studies.
The Bluesky Dataset refers collectively to a set of publicly accessible, large-scale, and multi-faceted social network data resources generated from the decentralized microblogging platform Bluesky. Enabled by Bluesky’s open AT Protocol and public data policies, these datasets support computational social science, network analysis, and machine learning, offering structured, longitudinal, and granular representations of user behavior, social ties, content creation, and higher-order interactions in a decentralized context. The most significant, high-coverage Bluesky datasets include BlueTempNet, “A Blue Start”, Bluesky scholarly communication corpora, cross-platform comparative datasets, and population-scale interaction or content archives.
1. Dataset Architectures and Data Models
Bluesky datasets employ both standard graph and advanced multi-network/hypergraph representations, leveraging the platform’s feature set:
- Pairwise directed networks: The canonical representation is , where is the set of users (nodes) and is the set of directed “follow” relations. Edge attributes (e.g., timestamps for temporal networks, sign for follow/block, etc.) are integrated as needed (Smith et al., 16 May 2025, Jeong et al., 24 Jul 2024).
- Higher-order interactions: Bluesky enables group-mediated relations through “starter packs,” which are user-curated collections of accounts (and sometimes feeds). These are formalized as a hypergraph , with as users/feeds and a set of hyperedges, each hyperedge corresponding to a starter pack (Smith et al., 16 May 2025).
- Temporal multi-networks: Datasets like BlueTempNet provide a temporal, multilayer graph,
where (follows), (blocks), and (user–feed actions) are time-stamped edge sets, analyzed jointly or per layer (Jeong et al., 24 Jul 2024).
- Content graphs and event logs: Large-scale text and interaction corpora comprise per-post/message records structured by event type (post, reply, repost, like), including rich metadata, engagement counts, and sometimes full-thread context (Failla et al., 29 Apr 2024, Dankulov et al., 22 Jan 2025, White et al., 21 Nov 2025).
- Structural and attribute-rich schemas: Datasets standardize fields such as anonymized user IDs, content text, timestamps (ranging from minute- to millisecond-level granularity), metadata on feeds/communities, and derived features (e.g., sentiment scores, stance labels) (Failla et al., 29 Apr 2024, Salloum et al., 3 Jun 2025, Balduf et al., 20 Jan 2025, Rostami et al., 9 Jun 2025).
2. Data Acquisition, Anonymization, and Metadata
Bluesky datasets are built by directly querying official APIs or streaming Firehose endpoints:
- Node, edge, and message extraction: Utilizing endpoints such as
app.bsky.graph.getFollows,app.bsky.graph.getActorStarterPacks, and app.bsky.feed.post/repost APIs, data collectors reconstruct network structure, group memberships, and complete content histories (Smith et al., 16 May 2025, Kleppmann et al., 5 Feb 2024, Failla et al., 29 Apr 2024). - Temporal precision: Millisecond-level precision is routinely available for edges (follows, blocks, join events), supporting inter-event time analysis and high-fidelity dynamic studies (Jeong et al., 24 Jul 2024).
- Anonymization: Public dumps map all user and content IDs to irreversibly anonymized or randomized integers or UUIDs, and PII is removed in accordance with platform policy (Dankulov et al., 22 Jan 2025, Failla et al., 29 Apr 2024, Smith et al., 16 May 2025).
- Comprehensive metadata: Rich auxiliary files detail user activity, account type, number of followers, posts, starter-pack affiliations, feed creation or joining actions, language, and (when relevant) OpenAlex or NewsGuard article-level metadata (Zheng et al., 24 Jul 2025, Reddy et al., 17 Jan 2025, Jeong et al., 24 Jul 2024).
| Dataset | Users (n) | Edges/Interactions (m) | Groupings | Content Volume | Distinctive Features |
|---|---|---|---|---|---|
| BlueTempNet (Jeong et al., 24 Jul 2024) | 147k members | 5.7M follows, 0.5M blocks | 39,968 feeds | N/A | Temporal, multi-network, ms-level |
| A Blue Start (Smith et al., 16 May 2025) | 26.7M users | 1.6B follows | 301k starter packs | N/A | Hypergraphs for higher-order ties |
| "I'm in the Bluesky…" (Failla et al., 29 Apr 2024) | 4.1M users | 145M follows, 23M replies, 63M reposts | 11 feeds | 237M posts | Population-level, complete posts |
| PolitiSky24 (Rostami et al., 9 Jun 2025) | 8.5k users | 16k (user-target stance pairs) | NA | 18M posts | User-level stance w/ rationale |
3. Temporal, Structural, and Higher-Order Data Properties
The Bluesky datasets collectively cover a wide range of user interaction scenarios and address both pairwise and community/group-level network science:
- Degree distributions: All major interaction networks display heavy-tailed degree distributions, often characterized by fitted power-law exponents (e.g., for follows) (Quelle et al., 27 May 2024).
- Clustering and connected components: High (normalized) clustering coefficients (10 to 200 times configuration-model baseline) are observed in follows, replies, and repost layers; giant strongly-connected components dominate, and smaller SCCs exist in both user and starter-pack hypergraphs (Smith et al., 16 May 2025, Failla et al., 29 Apr 2024).
- Higher-order metrics: Starter-pack hypergraphs permit s-line-graph densities, hypergraph k-cores, user-pair co-occurrence rates across lists, and community entropy quantification (Smith et al., 16 May 2025).
- Temporal coverage: Datasets range from Feb 2023 (platform launch) to May 2025, with rolling updates and sub-daily event precision in some corpora, supporting fine-grained longitudinal research (Jeong et al., 24 Jul 2024, Failla et al., 29 Apr 2024).
4. Specialized Content, Behavior, and Population Subsets
Several datasets capture distinctive behavioral, topical, or population subsets within Bluesky:
- Scholarly dissemination: The scholarly communication dataset identifies 87,470 posts referencing DOIs, joined with OpenAlex for bibliometric content/discipline analysis, with derived originality scores based on post/title cosine similarity (Zheng et al., 24 Jul 2025).
- Political collections and stance: PolitiSky24 and “Politics and polarization on Bluesky” provide both post-level and user-level stance detection for U.S. politics, including millions of stance-annotated posts, interaction graphs, and full pseudonymized posting histories (Rostami et al., 9 Jun 2025, Salloum et al., 3 Jun 2025).
- Cross-platform records: MADOC integrates standardized Bluesky subsets with Reddit, Koo, Voat across 12 communities, designed for toxic behavior and moderation research (Dankulov et al., 22 Jan 2025).
- Persona/thread-level behavior prediction: The SocialSim challenge dataset includes 6.4M Bluesky conversation threads, 12 action classes (with rare-action focus), and 25 persona clusters (White et al., 21 Nov 2025).
- News reliability: The MurkySky dataset resolves news-sharing posts to NewsGuard-rated domains, labeling reliability and extracting hashtag/topic/audience segmentation networks, with observed unreliable content prevalence ≈2% (Reddy et al., 17 Jan 2025).
5. Data Access, File Formats, and Tooling
Consistent with Bluesky’s open-data orientation, most core datasets are distributed under academic or open licenses (e.g., CC-BY 4.0) (Failla et al., 29 Apr 2024, Smith et al., 16 May 2025, Jeong et al., 24 Jul 2024). Key distribution and access conventions:
- File formats: CSV for edgelists/metadata; JSON and JSONL for full post/event records; GEXF for graph multilayers; Parquet for massive tabular content (Failla et al., 29 Apr 2024, Jeong et al., 24 Jul 2024, Dankulov et al., 22 Jan 2025).
- Dataset registries: Zenodo (e.g., https://doi.org/10.5281/zenodo.11082878), SOMAR (https://socialmediaarchive.org/record/78), IEEE DataPort (https://ieee-dataport.org/documents/bluetempnet-temporal-multi-network-dataset-social-interactions-bluesky-social).
- Code libraries: Datasets such as MADOC provide Python (pyMADOC) and R (rMADOC) clients. Scripts for data retrieval and cleaning are routinely released alongside data (Dankulov et al., 22 Jan 2025, Jeong et al., 24 Jul 2024).
- API coverage: Most data can be reconstructed/updated from the public Bluesky AT Protocol APIs (PDS xRPC, AppView, Firehose), with no private key required (Kleppmann et al., 5 Feb 2024, Jeong et al., 24 Jul 2024, Failla et al., 29 Apr 2024).
6. Research Applications and Limitations
Bluesky datasets serve as testbeds for:
- Dynamic network analysis: Structural balance, signed link prediction, community detection, contagion spread, and higher-order diffusion on groups (Jeong et al., 24 Jul 2024, Smith et al., 16 May 2025, Quelle et al., 27 May 2024).
- Content and engagement modeling: Altmetrics, originality metrics, text-based stance classification using dense retrieval and LLMs, engagement analysis (likes, reposts, reply rates) (Zheng et al., 24 Jul 2025, Rostami et al., 9 Jun 2025).
- Comparative and migration studies: Analysis of cross-platform user migration, peer contagion, and behavioral transplantation (e.g., Twitter→Bluesky) (Quelle et al., 30 May 2025, Dankulov et al., 22 Jan 2025).
- Algorithmic curation and moderation: Custom feed dynamics, user engagement with algorithmic rankings, and the impact of third-party labelers for decentralized moderation architectures (Quelle et al., 27 May 2024, Kleppmann et al., 5 Feb 2024).
- Limitations: Datasets are hampered by incomplete coverage pre-custom feeds (pre-May 2023), lack of private actions, niche platform adoption (with consequential cultural/coverage biases), and evolving API/sampling schemas (Jeong et al., 24 Jul 2024, Smith et al., 16 May 2025, Failla et al., 29 Apr 2024).
7. Significance and Prospects
The Bluesky Dataset family constitutes a unique resource for the network science, computational social science, and machine learning communities:
- Open architecture: Bluesky’s separation of identity, hosting, indexing, content feeds, and moderation exposes rich, composable data structures, facilitating novel research in decentralized systems (Kleppmann et al., 5 Feb 2024, Balduf et al., 22 Aug 2024).
- Granularity, scale, and diversity: Millisecond-resolved multilayer graphs, higher-order hypergraph structures, population-scale activity streams, and integration with external knowledge bases make possible analyses previously infeasible on traditional, siloed platforms.
- Use in benchmarking: These datasets allow development and evaluation of new methods in signed-temporal network analysis, hypergraph learning, stance detection (user- and post-level), rare-action prediction, moderation, diffusion modeling, and migration dynamics.
- Researcher access and legal/ethical compliance: All major datasets are released under FAIR-compliant terms with open code, documented limitations, robust anonymization, and explicit adherence to Bluesky’s public data policy (Jeong et al., 24 Jul 2024, Failla et al., 29 Apr 2024, Smith et al., 16 May 2025, Dankulov et al., 22 Jan 2025).
The Bluesky Dataset ecosystem is thus central to current methodological advancement in network, content, and group-dynamics research on decentralized, open social platforms.