Content Behavior Corpus (CBC)

Updated 3 June 2026

Content Behavior Corpus (CBC) is a large-scale, annotated dataset linking communicators, their content, and receiver behaviors from YouTube and Twitter.
It employs automated techniques such as ASR transcription, BLIP-2 captioning, and sentiment analysis to systematically annotate and normalize multimedia content.
CBC supports research on content-behavior interactions by enabling simulation, predictive modeling, and analysis of key metrics like views, likes, and replay values.

The Content Behavior Corpus (CBC) is a large-scale, public repository designed to support research on the intersection of content understanding and behavioral modeling. CBC provides structured, richly annotated data comprising communicators, their messages, and the corresponding receiver behaviors, targeting simulation, prediction, and optimization of communication effectiveness. CBC underpins the development of Large Content and Behavior Models (LCBMs), which extend beyond traditional language modeling by incorporating behavioral feedback such as likes, replays, and comment sentiment, thereby directly addressing receiver effects (Khandelwal et al., 2023).

1. Composition and Structure

CBC consists of two major corpora sourced from YouTube and Twitter. The YouTube subset contains approximately 40,000 public-domain videos spanning diverse categories, while the Twitter subset encompasses approximately 168 million tweets from 10,135 enterprise accounts covering the years 2007–2023. Each data point links (1) a communicator (YouTube channel or Twitter account), (2) the crafted content (video or tweet with associated metadata), and (3) behavior tokens indicating receiver response.

Corpus Data Schema Overview

Domain	Communicator	Message Content	Behavior Tokens
YouTube	Channel metadata	Video asset, title, desc.	Views, likes, like/view ratio, per-scene replay, comment sentiment summary
Twitter	Account metadata	Tweet text, media	Like_count, predicted engagement ('high'/'low')

On YouTube, videos are segmented into scenes, each annotated with automatic speech recognition (ASR) transcripts and vision model (BLIP-2) frame captions. For Twitter, tweets are coupled with media objects carrying BLIP-2 captions, keywords, and optionally colors/tones.

CBC uses a modular directory structure with dedicated JSONL and CSV files for messages and communicator metadata. Video-level and scene-level records are stored separately, facilitating multi-level analysis.

2. Data Collection and Preprocessing

YouTube data originates from the YouTube Data API v3, providing access to video metadata, engagement statistics (view and like counts), retention graphs (for replay analysis), and comments. Twitter data is sourced from the Twitter API, incorporating tweet text, timestamps, and account information.

The corpus construction employs several filtering and normalization steps:

YouTube videos lacking a retention graph or shorter than 15 seconds are excluded. Uniform sampling is applied to ensure broad category coverage.
Twitter data is filtered to enterprise accounts; tweets with more than four media attachments or non-English content are discarded.

Automated annotation includes:

Direct extraction of replay, view, and like statistics from APIs.
Comment sentiment labeling through an off-the-shelf sentiment analysis model with bucketed output (positive, neutral, negative).
Scene segmentation of videos via sampled retention-graph points, typically yielding $m \approx 10$ –20 scenes per video.
ASR transcription for each scene using Whisper v2.
Two randomly sampled video frames per scene receive BLIP-2-generated captions.

All textual content undergoes normalization, with vocabulary standardized through byte-pair encoding to approximately 50,000 types.

3. Data Formats and Access

CBC’s directory organization and data formats support both large-scale and targeted analysis. The core files in each domain include:

For YouTube:
- videos.jsonl: one JSON record per video (identifiers, channel metadata, engagement statistics, sentiment summary).
- scenes.jsonl: one JSON per scene, cross-referenced by video_id (scene boundaries, ASR, frame_caption, replay_value).
- metadata.csv: channel identifiers and names.
For Twitter:
- tweets.jsonl: tweet-level records (identifiers, text, media object array, like_count).
- accounts.csv: account identifiers, names, follower counts.

Records include fields such as ISO timestamps, string and integer identifiers, and floats (e.g., like/view ratio). All JSON examples adhere to field naming and formats as disclosed, which facilitates automated downstream processing.

Access and usage are streamlined via supplied Python loader scripts within cbc_loaders/, which return appropriately structured pandas.DataFrame objects for both YouTube and Twitter datasets. Command-line tools (cbc-stats, cbc-query) enable metric computation and subset queries from the shell.

4. Statistical Properties and Behavior Analysis

CBC documents explicit behavior distributions and content statistics for both YouTube and Twitter:

YouTube subset:
- Mean replay_value ≈ 48, σ ≈ 18, reflecting slight left skew (early scenes more heavily watched).
- Median total_views ≈ 250,000; 90th percentile ≈ 5 million.
- Mean like/view ratio ≈ 2.3%, σ ≈ 1.1%.
- ASR transcript length per scene: mean ≈ 12 words; unique word piece vocabulary ≈ 25,000.
Twitter subset:
- Mean like_count ≈ 45; median ≈ 3.
- “High”/“low” likes classes (top 20% by account) roughly balanced (≈50/50) when thresholded per account.
- Tweet length: mean ≈ 22 words; vocabulary ≈ 18,000 types.

CBC supports analysis of content/behavior interaction via observed correlations, including:

Pearson $r(\text{views}, \text{likes}) = 0.87$ in YouTube.
$P(\text{like} \mid \text{contains\_hyperlink}) \approx 0.12$ vs $P(\text{like} \mid \text{no\_link}) \approx 0.08$ for tweets.
$r(\text{replay\_value}, \text{comment\_sentiment\_positive}) = 0.31$ for videos.

A plausible implication is that replay_value may serve as a proxy for positive sentiment engagement in video content.

5. Applications and Usage

CBC supports multiple research tasks including content and behavior simulation, cross-domain domain adaptation, and behavior prediction tasks central to the emerging field of LCBMs (Khandelwal et al., 2023). Typical applications encompass:

Assessing the effect of content modifications on predicted behavioral metrics (e.g., like/view ratio, replay profile).
Learning joint representations that capture both semantic and behavioral attributes.
Generalization to unseen behavioral patterns and adaptation across platforms.

Data access is facilitated by example notebooks, JSON/CSV dumps, and in-repository Python scripts. The quick-start usage involves:

from cbc_loaders import load_youtube, load_twitter
videos = load_youtube("CBC/youtube/videos.jsonl")
scenes = load_youtube("CBC/youtube/scenes.jsonl", level="scenes")
tweets = load_twitter("CBC/twitter/tweets.jsonl")

Command-line querying is enabled for both statistics and advanced filtering, e.g.:

1 2	cbc-stats --subset youtube --metric "replay_value>70" --count cbc-query --source twitter --filter "like_count>1000" --limit 5

6. Licensing and Citation

CBC is distributed under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license, permitting both commercial and non-commercial use with attribution to the original creators. The canonical citation is: Ashmit Khandelwal et al., “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior,” ICLR 2024 (Khandelwal et al., 2023).

CBC’s release includes comprehensive documentation, example notebooks, and robust code interfaces, establishing it as a foundational asset for research at the intersection of content understanding and behavioral modeling.

Markdown Report Issue Upgrade to Chat

References (1)

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Content Behavior Corpus (CBC).