Bot Rasa Collection (BRASATO)
- BRASATO is a curated dataset of task-based Rasa chatbots defined by dialogue complexity, custom actions, and real-world adoption.
- It is constructed by rigorously filtering over 8,000 GitHub repositories based on pragmatic utility, conversational capabilities, and community validation.
- The dataset facilitates reproducible research on chatbot reliability, security, and empirical evaluation of conversational agents.
The Bot Rasa Collection (BRASATO) is a curated dataset of operational, task-based Rasa chatbots selected to serve as a benchmark for research on chatbot reliability, dialogue complexity, functional sophistication, and empirical evaluation of automated quality assessment techniques. BRASATO was constructed as a filtered subset from the TOFU-R snapshot—an initial pool of over 8,000 Rasa chatbot repositories harvested from GitHub—by applying stringent criteria based on pragmatic utility, conversational capability, and real-world adoption. Its primary objective is to facilitate reproducibility in chatbot evaluation, supply the community with a diverse pool of relevant subjects, and support empirical investigations into the robustness and security of open-source conversational agents (Masserini et al., 21 Aug 2025).
1. Rationale and Purpose
BRASATO was designed to address the lack of large-scale, contemporary, and high-quality datasets appropriate for empirical work on chatbot reliability and security. Many prior evaluation efforts depended on toy examples or outdated conversational agents, limiting their generalizability and relevance. BRASATO remedies this by assembling a meaningful, nontrivial set of real Rasa chatbots that have demonstrable functionality, actual user-facing capabilities, and community validation via popularity metrics (GitHub stars). The dataset aims to be a central resource for researchers seeking empirically grounded, reproducible studies on task-based chatbot systems, including aspects such as dialogue complexity, functional integration (custom actions), and multi-topic coverage.
2. Selection Criteria
BRASATO employs a rigorously defined multi-step filtering process applied to the TOFU-R superset. Three principal axes drive selection:
- Dialogue Complexity: Only chatbots with at least one intent and evidence of information retrieval (at least one entity or conversational slot) are included. This eliminates mere template responders or bots lacking genuine conversational understanding, ensuring all subjects process and extract meaningful user input.
- Functional Complexity: At least one custom action (backend script beyond simple response actions) must be present. This guarantees the chatbot transcends static dialogue, interacting directly with business logic or external systems.
- Utility Metrics: Each chatbot must be configured to support English, use an up-to-date Rasa @@@@1@@@@.x framework, and have attained a minimum level of GitHub popularity (at least one star), providing a proxy for community relevance.
Application of these criteria reduced TOFU-R’s several thousand repositories to a targeted subset of 198 chatbots comprising BRASATO.
3. Methodology and Automated Processing
The construction of BRASATO occurs in two phases:
- TOFU-R Dataset Compilation
- 1. Automated GitHub repository search (keywords “Rasa” and “chatbot”).
- 2. Classification via YAML domain file analysis to detect “intents.”
- 3. Extraction of individual chatbot definitions (recognizing multi-bot repositories).
- 4. Domain file parsing for conversation parameters (intents, entities, slots, actions, Rasa version).
- 5. Language extraction leveraging the Detect Language API, supported by manual verification.
- 6. Deduplication implemented with Python’s difflib library (95% code similarity threshold), favoring newer, more popular projects.
- BRASATO Curation
- 1. Filtering for dialogue, functional, and utility specifications detailed above.
- 2. External Service Annotation: Chatbot code and README files are statically analyzed and processed with LLMs (ChatGPT-4) using bespoke prompt templates (as depicted in Figure \ref{fig:prompts} of (Masserini et al., 21 Aug 2025)). Each extraction query is issued 10 times; an additional “merge” prompt synthesizes consensus results. The LLM configuration (temperature=1, top-p=0.15) was experimentally determined to provide optimal coverage.
- 3. Topic Extraction: ChatGPT-4 is used to classify subject domains (e.g. Finance, Medical, Sports), referencing repository descriptors and conversational parameters, and cross-referencing against a category schema (such as Google Play Store categories). Manual inspection assures fidelity.
The entire toolchain—including scripts for API queries, data extraction, prompt management, and automated analysis—is provided for ongoing dataset maintenance and future extension.
4. Dataset Composition and Structure
BRASATO comprises 198 English-supporting Rasa 3.x chatbots, each characterized by:
- At least one intent and one entity/slot for user information processing.
- At least one custom backend action.
- Annotation of external services invoked (e.g., databases, APIs), integrated through LLM-assisted code and documentation analysis.
- Topic categorization to facilitate cross-domain research.
This dataset’s structure supports analysis across topics, backend usage patterns, and external integration, furnishing a robust empirical substrate for system reliability, security, and dialogue complexity studies.
Selection Axis | Requirement | Justification |
---|---|---|
Dialogue Complexity | ≥1 intent; ≥1 entity/slot | Verifies true NLU capability |
Functional Complexity | ≥1 custom backend action | Confirms dynamic business logic |
Utility | English support; Rasa 3.x; ≥1 star on GH | Ensures currency and adoption |
5. Annotation and Topic Classification via LLMs
A distinctive element of BRASATO is its use of LLMs for enhanced annotation:
- External Service Extraction: ChatGPT-4, prompted via templated queries, reviews both code and documentation to list technical integrations (e.g., database queries, third-party API calls). Ten instantiations plus a merge step provide robust coverage.
- Topic Assignment: The model analyzes repository metadata and conversational schema to classify the core subject domain, harmonizing output to a predefined category list. Manual sampling verifies consistency and accuracy.
This methodology enables rich, reproducible metadata for each included chatbot, informing downstream research into domain adaptation, robustness, and functional diversity.
6. Maintenance, Update, and Accessibility
BRASATO’s supporting toolchain and methodology are explicitly designed for continuous dataset maintenance:
- Scripts and code for repository analysis, LLM prompting, deduplication, and language detection are openly available.
- The dataset construction workflow enables regular updates, ensuring fidelity to evolving open-source chatbot development best practices and the emergence of new conversational domains or backend integrations.
- Selection and annotation procedures are documented to maximize reproducibility across the community.
7. Research Significance and Use Cases
By aggregating a representative, operational, and richly annotated set of Rasa chatbots, BRASATO provides a standardized empirical foundation for:
- Automated reliability and robustness assessment across dialogue systems.
- Security analysis focused on real-world backend interactions.
- Test automation benchmarks for conversational agent evaluation.
- Studies on dialogue and functional complexity, topic adaptation, and multi-domain coverage in open-source chatbots.
The dataset directly addresses limitations in prior research subject pools, facilitating methodologically sound experiments and inter-paper comparison, and enabling the broader Rasa and conversational AI research community to advance practical, reproducible results at both the system and component levels (Masserini et al., 21 Aug 2025).