Rasa TOFU-R Chatbot Dataset
- Rasa TOFU-R is a dataset capturing a detailed snapshot of open-source chatbot development with extensive annotations on intents, actions, and language diversity.
- The dataset was systematically compiled through automated GitHub API queries, YAML parsing, and rigorous deduplication using code similarity metrics.
- Empirical findings reveal diverse dialogue and functional complexities that enable reproducible benchmarking and empirical evaluation in both academic and industrial contexts.
Rasa task-based chatbots from GitHub, codified as the TOFU-R dataset, represent a comprehensive, empirically-grounded snapshot of open-source Rasa chatbot development. TOFU-R consists of thousands of distinct chatbots extracted from public GitHub repositories, systematically identified, filtered, and annotated. By capturing a detailed cross-section of dialogue and functional complexity, language diversity, and versioning, TOFU-R enables reproducible research and benchmarking in chatbot quality assurance and empirical evaluation of both academic and industrial systems.
1. Dataset Construction Methodology
The TOFU-R dataset was assembled through a multi-stage, automated pipeline designed to reflect the state of open-source Rasa chatbot development as of January 2025 (Masserini et al., 21 Aug 2025).
- Repository Discovery: The GitHub REST API was queried for repositories containing both "Rasa" and "chatbot" in metadata fields (title, README, description, or topics), resulting in 8,436 non-empty repositories.
- Classification Criteria: Repositories were programmatically inspected for YAML domain files containing the "intents" field—a necessary component of all authentic Rasa chatbot configurations—yielding a first-level filter of "real" chatbot candidates.
- Artifact Extraction: Recognition that multi-chatbot repositories are common led to the atomic extraction of chatbots as units where each unique domain file (or group of such files per folder) delineates a distinct chatbot entity. This process identified 6,819 initial chatbot artifacts.
- Parameter and Language Extraction: For each chatbot, the domain files were parsed to extract the sets of intents, entities, slots, actions, and, if provided, the Rasa version. Language identification was performed over the training phrases and responses using a Detect Language API, ensuring that chatbots lacking any human language data were excluded.
- Deduplication: Cleanliness was enforced through a combination of configuration-comparison and code similarity (difflib) filters with a 95% code similarity threshold. Duplicates were resolved by retaining the instance with the most recent Rasa version, superior community metrics (stars, forks), and earliest creation date.
This methodology yields a dataset that is both large-scale and systematically vetted, providing a suitable basis for representative empirical analysis and replicable experimentation.
2. Key Extracted Metrics and Complexity Characterization
TOFU-R captures both dialogue and functional complexity through explicit, quantifiable metrics:
Metric Type | Key Extraction | Range/Significance |
---|---|---|
Dialogue Complexity | # of intents, entities, slots | 1 to >1,500 intents; indicative of interpretation diversity and domain generality |
Functional Complexity | # of actions, custom actions, | Presence of custom actions and API integrations signals capacity for business logic and external service invocation |
Language Support | Trained & response languages | English dominates, but multilingual support is explicit in the metadata |
Rasa Versioning | Extracted version fields | Used for filtering/up-to-date benchmarking |
Code Similarity | difflib 95% threshold | Ensures unique, non-duplicative entries |
Dialogue complexity is reflected in the breadth of user input types a chatbot can process, while functional complexity is anchored in the bot’s action space, specifically the count and types of custom actions and API/database integrations as confirmed by LLM-assisted static code and README analysis.
3. Technical Validation and Quantitative Evaluation
To support the reliability of the extraction and annotation process, the system includes both language and service integration validation components:
- External Service Extraction: Utilizes ChatGPT-4 with tuned prompts (temperature=1, top-p=0.15) to analyze code and README files for external service calls. Precision and recall were measured on a manually annotated sample, achieving 0.90 precision and 1.00 recall. The resulting F-score, calculated as:
demonstrates that the service extraction methodology performs with high accuracy in practice.
This rigorous validation ensures that the resulting metrics, annotations, and labels are robust and suitable for downstream research or automated assessment workflows.
4. Empirical Findings: Diversity and Distribution
TOFU-R reveals key empirical findings about Rasa chatbot development practices:
- Heterogeneity: The dataset includes both trivial (single-intent) and highly complex (multi-hundred-intent, multi-action) chatbots, confirming a wide spectrum of usage from toy to production-grade systems.
- Action Diversity: While many chatbots implement only basic responses, a substantial subset feature custom actions and integrations with external APIs/databases, reflecting real-world application ambitions.
- Language and Version Skew: The overwhelming majority of chatbots are English-language, with varying Rasa version adoption, highlighting both global reach and fragmentation across framework releases.
This diversity establishes TOFU-R as an essential empirical resource for any research on chatbot robustness, dialog complexity, and real-world adoption trends.
5. From TOFU-R to the BRASATO Curated Subset
Recognizing that not all chatbots in TOFU-R represent best practices or functional complexity, the authors curated a secondary dataset—BRASATO—by applying further criteria:
- Custom Actions: Only chatbots with implemented custom actions are retained to focus on bots capable of executing nontrivial business logic or external service calls.
- Language and Versioning: Filters for English and up-to-date Rasa versions ensure relevance to international users and the current ecosystem.
- Utility and Documentation: Preference for chatbots with higher community metrics and maintained documentation.
BRASATO is intended for in-depth reproducible experiments, benchmarking, and as a testbed for emerging automated quality assessment techniques.
6. Impact and Research Implications
TOFU-R and its curated BRASATO subset establish novel empirical benchmarks and have significant implications for the assessment and advancement of task-based chatbot research:
- Quality Assessment: The availability of large-scale, systematically constructed datasets supersedes the use of limited or outdated "toy" examples, enabling meaningful evaluation of reliability, security, and robustness.
- Benchmarking: Researchers can assess NLU, dialogue management, and service-orchestration methods against a diverse, real-world corpus.
- Meta-Analysis: The structured extraction of complexity and functional parameters supports scalable meta-studies on trends, common practices, and emerging design patterns in open-source chatbot development.
A plausible implication is that, with TOFU-R and BRASATO, the field now possesses the means to perform genuinely representative and repeatable empirical analyses, closing a critical gap in the empirical foundations of conversational AI.
7. Limitations and Future Curation
While comprehensive within its scope, TOFU-R reflects the state of GitHub-accessible, open-source Rasa chatbot development as of a fixed date (January 2025), so it may lag novel practices not yet published or restricted to private repositories. Ongoing maintenance and periodic updates are necessary to ensure longitudinal relevance. The methodology—rooted in reproducible API queries, rigorous parsing, and validated NLP/LLM components—is generalizable to future software snapshots and other framework ecosystems, supporting the evolving needs of chatbot quality assurance and empirical AI research.