Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform (2509.21292v1)

Published 25 Sep 2025 in cs.SE

Abstract: Promoting participation on digital platforms such as Brasil Participativo has emerged as a top priority for governments worldwide. However, due to the sheer volume of contributions, much of this engagement goes underutilized, as organizing it presents significant challenges: (1) manual classification is unfeasible at scale; (2) expert involvement is required; and (3) alignment with official taxonomies is necessary. In this paper, we introduce an approach that combines BERTopic with seed words and automatic validation by LLMs. Initial results indicate that the generated topics are coherent and institutionally aligned, with minimal human effort. This methodology enables governments to transform large volumes of citizen input into actionable data for public policy.

Summary

  • The paper introduces a robust pipeline using BERTopic, seed words, and LLM validation to automate the classification of civic proposals.
  • It demonstrates that semi-supervised approaches with VCGE seed words significantly improve taxonomy alignment, as evidenced by ARI and NMI metrics.
  • The study offers practical insights for real-time policy responses by streamlining civic engagement on Brazil's national participation platform.

Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform

Introduction

The paper "Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform" addresses the challenge of effectively organizing civic participation data on digital platforms. With the advent of platforms such as "Brasil Participativo" (BP), capturing and organizing civic proposals has become a high-priority task for governments. However, manual classification is impractical at scale, requiring interventions from specialists and alignment with official taxonomies. This paper introduces a methodology combining BERTopic, seed words, and automatic validation by LLMs to efficiently transform vast amounts of citizen input into actionable data. The proposed approach seeks to preserve semantic coherence and align institutionally, promising minimal human intervention.

Methodology

The methodological framework is comprehensive, incorporating data extraction and pre-processing, internal validation and hyperparameter optimization, topic modeling, and external validation. Crucial to this pipeline are:

  • Pre-processing: Involves selection and categorization of civic proposals, using a significant portion of data for training (80%) and the remainder for testing (20%).
  • Topic Modeling: Utilizes BERTopic with both unsupervised and semi-supervised strategies. The semi-supervised approach integrates seed words derived from the Vocabulário Controlado do Governo Eletrônico (VCGE), ensuring the generated topics align with official government categories. Figure 1

    Figure 1: Pipeline de categorização temática com BERTopic.

  • External Validation: Employs LLMs such as Gemma 3 and GPT models to automatically label proposals, utilizing metrics like Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) for alignment verification.

Results

Parameter Optimization

Researchers identified optimal configurations for BERTopic that maximized semantic coherence and thematic diversity. Using BERTimbau-large for embedding generation proved superior across different parameter settings due to its ability to balance internal coherence and thematic diversity. Figure 2

Figure 2: Weighted Score por Modelo e Número de Tópicos.

The optimal configuration for extracting 56 topics involved a uni-gram model with specific target numbers for topics and topic size. This configuration effectively balanced the thematic diversity and robustness of clusters.

Impact of Semi-supervision

The incorporation of seed words significantly improved semantic alignment with VCGE categories, enhancing both semantic coherence and resultant thematic consistency without substantially compromising topic diversity. Figure 3

Figure 3: Comparação entre não-supervisionado e semi-supervisionado.

Comparative studies between unsupervised and semi-supervised approaches revealed measurable improvements, particularly in ARI and NMI metrics, which reflect the strength of the alignment with official taxonomy at both high-level and more detailed levels. Figure 4

Figure 4: Métricas externas de alinhamento (ARI e NMI) nos níveis N1 e N2.

Discussion

Strategic and Practical Implications

The paper delineates a practical approach that can be seamlessly integrated into civic engagement platforms to automate the categorization process, significantly reducing manual workload while simultaneously ensuring high-quality information flow—vital for strategic policy formulation. This methodology supports real-time insights into emergent societal trends, such as shifts in public interest towards health or environmental concerns.

Enhancements and Future Directions

This research affirms the utility of integrating linguistic and institutional knowledge represented by specialized embeddings and official vocabularies to capture semantic complexities and alignments at scale. Future developments should focus on periodic expert review to refine seed words, ensuring model adaptability and relevance within dynamic democratic contexts.

Conclusion

This paper successfully presents a robust pipeline for transforming civic proposals into actionable intelligence for policymakers by leveraging BERTopic, specialized LLMs, and official taxonomies. While LLMs effectively automate many processes, ongoing human validation remains necessary for refining and adapting the underlying vocabularies and parameters in varied contexts. This paper offers a scalable solution for enabling participatory governance, enriching public sector responsiveness and engagement with citizens.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 4 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com