Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Clustering-Based Curation Methodology

Updated 28 September 2025

Clustering-based curation is a technique that uses unsupervised and semi-supervised methods, such as standard, seeded, and pairwise constrained KMeans, to organize large, unstructured datasets.
It leverages domain-specific seed points and pairwise constraints to integrate human annotations, overcoming issues like noisy OCR data and subjective category boundaries.
The methodology scales through platforms like BODHI, employing iterative, crowd-driven feedback to enhance digital curation, search, and semantic retrieval in vast archives.

Clustering-based curation methodology is an approach that employs unsupervised or semi-supervised clustering algorithms, often guided or informed by human expertise, to organize, annotate, or refine large, unstructured corpora—especially when manual categorization is infeasible due to data volume or ambiguity of categories. In settings such as historical newspaper archives, where original labels are coarse (“editorial”) and OCR-transcribed text is noisy, clustering serves as a mechanism to generate fine-grained, semantically meaningful categories. Importantly, this methodology adapts to the subjectivity and inconsistency of human annotation by encoding domain knowledge in the form of seeds or constraints and leveraging iterative, crowdsourced refinement through systems like BODHI. The following sections expound on methodological principles, algorithms, the interplay of subjective human input with clustering, evaluation metrics, system architecture, and implications for large-scale digital curation (Dutta et al., 2012).

1. Core Clustering Algorithms and Mathematical Formulation

Three principal clustering paradigms underpin this methodology:

a. Standard KMeans

Given $N$ articles represented by bag-of-words tf–idf vectors in $\mathbb{R}^d$ , KMeans seeks $K$ clusters with centroids $\{C_k\}_{k=1}^K$ that minimize the potential function:

$\varphi = \sum_{x \in X} \min_{c \in C} \mathrm{Dist}(x-c)$

where typically $\mathrm{Dist}$ is the squared Euclidean norm. The iterative algorithm alternates between assignment (each $x$ to nearest centroid) and update steps (centroid as mean of cluster), converging to a local minimum.

b. Seeded KMeans (Semi-Supervised)

To inject domain knowledge and overcome limitations of pure unsupervised learning, “seed sets” $S \subset X$ with user-provided cluster assignments initialize centroids. These seed points guide the search space of subsequent assignments, leading to clusters more aligned with user-conceived distinctions. Empirical mutual information (see below) quantifies the alignment between seed-informed clusters and human “ground truth.”

c. Pairwise Constrained Clustering (PCKMeans)

This method incorporates pairwise constraints obtained from human annotation:

Must-link: $c_=(a, b)$ stipulates $a, b$ belong to the same cluster.
Cannot-link: $c_\neq(a, b)$ stipulates $a, b$ be in different clusters.

PCKMeans modifies assignment cost:

$\min_{h} \left(\frac{1}{2}\|x - \mu_h\|^2 + w \sum_{(x,x_j)\in M} 1[h \neq l_j] + w \sum_{(x,x_j)\in C} 1[h = l_j]\right)$

$M$ , $C$ encode must/cannot-link sets, $w$ is the constraint weight, and $l_j$ the cluster of $x_j$ . Transitive closure is computed for must-links before the standard KMeans iterations proceed with the penalty-augmented assignment.

2. Workflow: From Noisy OCR to Curated Categories

The curation workflow is anchored by the following stages:

Text Preprocessing: OCR outputs are converted to tf–idf vectors with stopword removal.
Pilot Human Annotation: A subset of articles (e.g., 25) is categorized by multiple – often inconsistent – human annotators into subcategories (e.g., “politics,” “human interest,” “death”).
Seed and Constraint Generation: The produced subjective labels become seeds for Seeded KMeans and instance-level (must/cannot-link) constraints for PCKMeans.
Algorithmic Clustering: Applied KMeans (unsupervised), Seeded KMeans, and PCKMeans to the annotated sample; evaluated the empirical mutual information (detailed below) to compare against human-inferred “ground truth.”
Constraint Informativeness and Coherence: Informativeness measures how much new information constraints add beyond unconstrained runs. Constraint coherence checks for consistency (i.e., lack of contradictions between must/cannot-link sets).
Scaling via BODHI: Broader collection of corrections, tags, and constraints from end users to support scaling to the full corpus.

A tabular summary of roles of annotation in the process:

Function	Use in Clustering	Impact on Algorithm
Sub-categories	Seed points	Seeded KMeans
Must/cannot-link	Clustering constraints	PCKMeans
Inform/Coherence scores	Evaluation	Annotator benchmarking

3. Handling the Challenge of Subjectivity in Annotation

The underlying data revealed substantial subjectivity:

Disparate number of categories (8–14) for the same sample by different users.
Some articles assigned to multiple categories, e.g., both “politics” and “sports.”
Inconsistent boundaries between human-inferred topics.

To address this, the methodology:

Refrains from imposing a rigid “ground truth,” instead quantifying agreement via empirical mutual information and analyzing the informativeness of constraints.
Recognizes and incorporates the “wisdom of crowds,” using aggregate (not individual) annotations as a soft standard for comparison.
Assesses the agreement between clustering results and human-labeled categories not in an absolute way, but via information-theoretic metrics accounting for subjective diversity.

4. Mutual Information and Constraint Informativeness Metrics

Cluster–validity measure

Clusters are evaluated against (possibly aggregated) human annotations by empirical mutual information: $\hat{H}(C) = -\sum_c \frac{h(c)}{n} \log \left( \frac{h(c)}{n} \right)$

$\hat{H}(C | K) = -\sum_{c,k} \frac{h(c,k)}{n} \log \left( \frac{h(c,k)}{h(k)} \right)$

where $h(c,k)$ is the count of articles with human class $c$ in cluster $k$ , $h(c) = \sum_k h(c,k)$ , $h(k) = \sum_c h(c,k)$ , $n$ is the total number of articles. Mutual information is then

$\mathrm{MI} = \hat{H}(C) - \hat{H}(C|K)$

Informativeness of constraints is measured by the increase in MI compared to an unconstrained baseline, and coherence by checking for constraint conflicts.

Empirical results showed that with a sufficient volume of constraints (e.g., over 200 in the paper), clustering performance (as measured by MI) improves, indicating the value of collecting diverse but informative human input despite inherent subjectivity.

5. System Architecture and Role of BODHI

The BODHI system is a crowdsourcing platform engineered to enable large-scale, ongoing refinement of the newspaper archive:

User-facing interface: Library patrons interact with high-res scans, correct OCR mistakes, and tag articles using a web interface (Ruby-on-Rails).
Backend and Data Management: All metadata (corrections, tags, custom segmentations) are stored in a PostgreSQL database, indexed for retrieval by Apache Lucene.
Pipeline: ETL process ingests OCR outputs and images, updates the database, and ensures seamless back-end to front-end integration.
Scaling human input: By logging keyword corrections, article segmentations, and user’s own topic tags, BODHI supplies the growing repository of constraints and tags needed for scalable, semi-supervised clustering.
Integration Objective: BODHI is partially integrated with the Chronicling America infrastructure, and is designed to further enhance search, retrieval, and semantic grouping by iterative refinement based on actual user interactions.

6. Implications and Generalization

Clustering-based curation, augmented by subjective human annotation and scalable annotation infrastructure, directly addresses the limitations of both fully unsupervised (uninformative) and fully manual (impractical) approaches for massive, noisy corpora. The methodology:

Enables pragmatic, application-driven curation (e.g., better search, more nuanced topical organization in historic archives).
Demonstrates empirically that constraint-augmented clustering methods (Seeded KMeans, PCKMeans) benefit from even subjective, inconsistent human input, provided enough data is available to extract consensus.
Establishes fully-unlabeled, coarsely-structured digital libraries as promising candidates for “interactive” curation, where iterative machine- and human-guided feedback loops systematically refine both clustering and metadata quality.

This approach is especially relevant for digital humanities, library informatics, and archives involving noisy, ambiguous, or highly contextual data, offering a robust computational model for evolving curated digital collections using a blend of statistical methods and crowd wisdom.

PDF Markdown Chat (Pro)

References (1)

Leveraging Subjective Human Annotation for Clustering Historic Newspaper Articles (2012)

Follow Topic

Get notified by email when new papers are published related to Clustering-Based Curation Methodology.