Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

PNGT-26K: Persian Names, Gender & Transliteration

Updated 21 September 2025
  • PNGT-26K is a curated collection of ~26,000 Persian name tuples with gender labels and English transliterations, addressing transliteration ambiguities.
  • It underpins two frameworks—Open Gender Detection and Nominalist—enabling probabilistic gender inference and culturally informed username generation.
  • The dataset employs advanced normalization (using the Hazm library) and rigorous manual validation to preserve authentic naming conventions.

The PNGT-26K dataset is a systematically curated resource comprising approximately 26,000 tuples, each containing a Persian name, its commonly associated gender, and the corresponding English transliteration. Developed to address the transliteration inconsistencies and culturally specific naming conventions inherent to the Persian language, PNGT-26K underpins two production-ready frameworks—Open Gender Detection and Nominalist—that leverage the dataset to enable probabilistic gender detection and agentic username generation in digital environments (Bijary et al., 14 Sep 2025).

1. Composition and Structure

PNGT-26K aggregates data from diverse sources, primarily existing Kaggle and GitHub name collections, with rigorous manual validation and normalization to ensure integrity and consistency. Each tuple in the dataset consists of:

  • The original Persian name
  • The most probable gender association (65% male, 35% female)
  • The name’s English transliteration

A key challenge addressed is the lack of a one-to-one mapping between Persian script and Latin characters, leading to “one-to-many” transliteration scenarios. To mitigate this, the dataset includes either multiple or systematically selected transliterations per name. Cultural phenomena such as compound male naming and gendered character patterns are preserved, enabling analytics consistent with authentic Persian naming practices.

Advanced preprocessing, notably with the Hazm Python library, resolves character-level normalization—specifically protocolizing the representation of frequently variant letters such as “yeh” and “kaf”—thereby minimizing duplication and harmonizing Unicode usage.

Field Example Role
Persian Name سعید Canonical input
Gender Male Gold label
Transliteration Saeed For Latin-script tasks

2. Principal Applications and Framework Integration

The dataset is foundational to two frameworks:

Open Gender Detection Framework

A multimodal, production-grade tool for inferring user gender on digital platforms, Open Gender Detection fuses textual and visual modalities:

  • Textual inference: Name matching applies normalized Levenshtein distance over normalized strings to retrieve top-K matches and probabilistically infer gender based on aggregate scores.
  • Visual inference: Profile photo analysis leverages OpenCLIP embeddings and a support vector machine (SVM) classifier trained on approximately 160,000 images for gender labeling.
  • Fusion strategy: Outputs from both modalities are integrated through a mediator function implementing a weighted voting scheme, yielding a final probabilistic gender prediction.

This dual-modality approach supports robust demographic analysis and individualized user experiences in environments where either modality alone may be unreliable.

Nominalist Framework

Nominalist is a multi-agent system for username proposal, tightly integrating PNGT-26K for culturally aware and linguistically appropriate identity suggestions:

  • Rule-based generation: Deterministic transformations (e.g., underscores, numeric suffixes, dot notations) produce baseline usernames derived from name-transliterated data.
  • Creative generation: An OpenAI-like API, with prompt engineering, generates diverse alternatives beyond deterministic variants.
  • Evaluation mechanism: A ReviewerAgent assesses uniqueness, memorability, and suitability, synthesizing AI-derived rankings (weighted at 60%) with heuristic metrics (40%).

Nominalist is designed for seamless embedding in website backends, facilitating digital identity creation with minimal integration overhead.

3. Technical and Methodological Foundations

PNGT-26K’s construction and exploitation involved several technical steps:

  • Data Curation: Aggregated via random sampling from open repositories, with native speakers engaged in manual review and consolidation.
  • Text Normalization: The Hazm Python library standardizes Unicode representations, eliminating duplication attributable to minor codepoint discrepancies.
  • Quality Validation: A local LLM (DeepSeek-R1-Distill-Qwen-32B) flags potentially anomalous transliterations, which are subsequently vetted by human annotators.
  • Name Matching Algorithm: The name-based gender detection component uses normalized Levenshtein distance:

dlev(a,b)=D(a,b)max(a,b)d_{\text{lev}}(a, b) = \frac{D(a, b)}{\max(|a|, |b|)}

where D(a,b)D(a, b) is computed recursively:

D(i,j)={max(i,j)if min(i,j)=0 min{D(i1,j)+1, D(i,j1)+1, D(i1,j1)+1[aibj]}D(i, j) = \begin{cases} \max(i, j) & \text{if } \min(i, j) = 0 \ \min \{ D(i-1, j) + 1,\ D(i, j-1) + 1,\ D(i-1, j-1) + 1_{[a_i \ne b_j]} \} \end{cases}

  • System Architecture: Both Open Gender Detection and Nominalist frameworks employ a modular design, permitting substitution of the underlying dataset for other languages or orthographies with comparable preprocessing and normalization.

4. Accessibility and Deployment

PNGT-26K and its associated frameworks are distributed for public access under open terms:

  • Dataset availability: Hosted on Hugging Face with direct links provided in the publication.
  • Framework distribution: Open Gender Detection and Nominalist are open-source (GitHub), supporting transparency and reproducibility.
  • Deployment flexibility: Nominalist is containerized using Docker, ensuring robust cross-platform adoption and minimal integration friction.
  • Extensibility: The modularity of both frameworks supports adaptation to other languages or use-cases with similar data formats and normalization requirements.

The publication advises users to consult repository documentation for definitive licensing terms, as explicit licenses are not specified.

5. Research Impact and Continuing Directions

The PNGT-26K dataset directly addresses the scarcity of comprehensive resources for Persian name processing, particularly in tasks requiring accurate gender annotation and transliteration. Its systematic approach to normalization and validation enables:

  • Improved gender detection for Persian and other non-Western name contexts, overcoming accuracy degradations exhibited by Western-centric tools.
  • Multimodal and culturally sensitive NLP applications, reducing systemic bias in online platforms.
  • Facilitated user onboarding and digital identity verification through agentic username proposal systems.

Potential research extensions include:

  • Expanding coverage to encompass regional name variants and richer linguistic metadata.
  • Refining transliteration practices to further reduce variability and ambiguity.
  • Integrating semantic and contextual clues for more nuanced gender inference.
  • Domain-specific adaptations of username generation, incorporating additional user profile dimensions.
  • Cross-linguistic generalization using equivalent datasets and preprocessing regimes for other scripts and cultural naming patterns.

6. Significance within Computational Linguistics

PNGT-26K represents a pivotal resource for the advancement of equitable, culturally informed natural language processing paradigms. By delivering high-fidelity, validated name–gender–transliteration tuples and providing robust, production-grade frameworks, the work sets a benchmark for future multilingual resource development and fair digital identity management. The dataset’s public availability and methodological transparency position it as a reference point for both academic research and practical system deployment within the Persian linguistic context and potentially across similarly underserved languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PNGT-26K Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube