Open-Ended Feature Discovery

Updated 26 November 2025

Open-ended feature discovery is an unsupervised process that automatically identifies novel, unforeseen patterns in data without relying on predefined labels.
It leverages techniques like clustering, sparse autoencoders, and evolutionary search to continuously uncover and update latent feature representations across various domains.
This approach enhances applications in scientific exploration, generative language and vision models, and adaptive real-time systems by enabling dynamic discovery beyond static taxonomies.

Open-ended feature discovery refers to the automatic, unsupervised or weakly supervised identification and characterization of novel, often unforeseen features or patterns in data, with no prior enumeration of the target set. Unlike conventional supervised approaches—which rely on fixed label sets and known categories—open-ended discovery aims for unbounded, continual surfacing of new structures, behaviors, or semantic entities as data and exploration proceed. This paradigm has emerged across computational science, vision, language modeling, robotics, and remote sensing as a response to the limitations of static, pre-closed taxonomies in dynamic and complex environments.

1. Theoretical Foundations and General Algorithms

Open-ended feature discovery is algorithmically instantiated through recurring principles: unbounded search or sampling over candidate entities, unsupervised or weakly supervised extraction and clustering of candidate features, continual metric-based evaluation for novelty and informativeness, and adaptive expansion of the feature (or behavior) repertoire.

In LLMs, for example, open-ended feature discovery is addressed through the Bias Association Discovery Framework (BADF), which generalizes to any entity-feature association mapping (Pan et al., 2 Aug 2025). BADF considers a set of entities $E=\{e_1,...,e_n\}$ and an extremely large, implicitly defined feature space $F$ (natural-language descriptors). A generative function $G(e;P)$ , typically an LLM prompted with instructions $P$ , elicits free-form textual descriptions or lists. Candidate features are extracted via LLM-based parsing and iterative refinement, then embedded (e.g., via sentence transformers) and clustered with a similarity threshold $\delta$ . For each entity-feature pair, candidate features are scored on frequency and distinctiveness—e.g., $s(e,f) = \alpha s_1(e,f) + (1-\alpha) s_2(e,f)$ , with $s_1$ as empirical salience and $s_2$ as distinctiveness relative to other entities. This pipeline is concretized with step-wise pseudocode, hyperparameter guidance, and coverage/precision/novelty metrics.

In visual domains, open-ended discovery can leverage unsupervised models (e.g., sparse autoencoders (Stevens et al., 21 Nov 2025)), evolutionary search (e.g., Quality-Diversity in Lenia (Faldor et al., 6 Jun 2024)), or advanced object detection architectures with unconstrained query generation (Lin et al., 25 May 2025). These methods share the goal of uncovering a potentially unbounded set of latent features or classes that were not foreseen during (pre-)training.

2. Open-ended Feature Discovery in Generative Language and Multimodal Models

In natural language processing and generative vision-LLMs, open-ended discovery exploits the richness of large pretrained models to surface novel feature associations and previously unseen concepts.

BADF operates by systematically prompting LLMs with open-ended requests for entity characterization (“List 10 distinctive features of ENTITY”), extracting and canonicalizing salient phrases, embedding and clustering these, and then scoring features both by frequency and by distinctiveness (the degree to which they are unique to an entity versus others).

Pivotal aspects:

Prompt engineering: Prompts are varied among explicit listing, narrative generation, and sentiment/context conditioning to elicit wide feature coverage.
Feature aggregation: Semantic clustering (threshold $\delta$ ) groups synonymous or similar features.
Scoring and ranking: Combines per-entity salience with distinctiveness to prioritize features that define an entity.
Quantitative evaluation: Measures include coverage, precision, and novelty relative to gold sets or previously known features.

This framework is general and applicable to domains such as product feature mining, character trait discovery in literature, or social bias analysis in LLMs.

Open-Set and Open-Ended Multimodal Perception

VL-SAM-V2 (Lin et al., 25 May 2025) and OSDA (Chen et al., 23 Sep 2025) exemplify class-agnostic, label-free discovery in high-dimensional visual (and vision-language) input.

OSDA employs a three-stage pipeline: (i) promptable segmentation for geometric, class-agnostic region discovery; (ii) a two-phase, LoRA-fine-tuned multimodal LLM for semantic attribution and description without label exposure; (iii) closed-loop automated and human evaluation for semantic and pixel-level quality. SAM2 is used in a fully label-free mode to discover any visually distinct region, and multimodal LLMs generate human-readable descriptions from regions at inference time.
VL-SAM-V2 fuses queries from open-set (predefined-category) and open-ended (category-free) streams using a transformer-based query fusion module, coupled with ranked learnable queries and denoising-point strategies. In open-ended detection, only general (category-free) queries are used, enabling the surfacing of objects and concepts absent from any training vocabulary.

Both frameworks decouple geometric (region/mask/box) discovery from semantic attribution, enforcing open-endedness both at the pixel/object and semantic levels.

3. Unsupervised Feature Learning and Scientific Discovery

In scientific data domains, open-ended discovery requires scalable, purely unsupervised algorithms capable of extracting interpretable, monosemantic features from foundation model representations.

Sparse autoencoders (SAEs) decompose the activations $x\in\mathbb{R}^d$ of a foundation model (e.g., ViT, DINOv3) into nonnegative, sparse codes $z\in\mathbb{R}^n$ via $z = \textrm{ReLU}(W_{\rm enc}(x-b_{\rm dec}) + b_{\rm enc})$ with a full objective $L(\theta; x) = ||x - \hat x||_2^2 + \lambda||z||_1$ . The resulting features correspond to dictionary elements, each "explaining" only a small subset of input patches.

Core properties enabling open-endedness:

No label supervision: Training proceeds over unlabeled data, extracting monosemantic, human-interpretable directions.
Scalability: The method is demonstrated on 100M patches with 16,384 features.
Rediscovery and coverage metrics: In controlled settings (e.g., ADE20K segmentation, FishVista anatomical parts), SAE latents rediscover a large fraction (up to 90% coverage at AP≥0.3 for FishVista) of ground-truth concepts, outperforming PCA and k-means in concept alignment.
Generalization across domains: The approach is agnostic to the foundation model and modality, with extensions suggested for protein LLMs and genomics.

The Matryoshka SAE variant further avoids feature splitting by learning nested dictionaries.

4. Evolutionary and Algorithmic Approaches in Complex Dynamical Systems

Evolutionary search and quality-diversity optimize for sustained novelty and diversity in agent behaviors or system dynamics, without predefining what constitutes a salient pattern.

In Lenia, a continuous cellular automaton, open-ended feature discovery is realized through “Leniabreeder”—a framework coupling dynamical pattern evolution with both manually specified and unsupervised diversity criteria.

Essential components:

Descriptor spaces: Handcrafted features (e.g., mass, center of mass, velocity, color) and learned descriptors via a variational autoencoder over cropped pattern trajectories.
Quality-Diversity pipelines: Both MAP-Elites (manual descriptors, fixed niches) and AURORA (emergent descriptors, co-evolution of archive and descriptor space).
Fitness functions: Tailored for properties such as persistent motion, structural stability, or homeostasis in latent space.
Evidence for open-endedness: Empirical metrics demonstrate sustained growth in archive entropy, pixel variance, and cumulative novelty intake, without plateauing after millions of evaluations. Coverage of continuous color and behavior spaces approaches 50% depending on descriptors.

This sustains unbounded diversity in a regime analogous to biological evolution.

5. Clustering and Dictionary Approaches in Open-Ended Object Recognition

Unsupervised clustering and dictionary-learning methods address open-ended recognition by incrementally structuring new observations—instances, features, or categories—as they arise.

A two-stage approach combines open-set detection (objectness-based) and unsupervised clustering:

Stage 1: An open-set Faster R-CNN detects known and unknown RoIs, using energy-based criteria to flag "unknown" instances.
Stage 2: RoI features are embedded via a self-supervised, contrastive mechanism and clustered using constrained $k$ -means. The number of clusters (novel categories) is estimated online; assignments to known categories are fixed.
Performance metrics: Detection is evaluated via mAP and unknown/known recall/precision; discovery via normalized mutual information, accuracy, and purity. The approach discovers meaningful, novel categories directly from unlabeled detections.

In open-ended 3D object recognition, a hybrid model of latent Dirichlet Allocation and incremental dictionary learning is constructed:

Shared topics via LDA: Object views are encoded as topic distributions over a general topic dictionary, updated via collapsed Gibbs sampling.
Category-specific dictionaries: For each class, a dedicated visual dictionary is incrementally updated with new object views using online k-means.
Open-ended learning protocol: New samples are classified using object-category distances in concatenated topic+dictionary space, and teaching/correction operations expand the category set and update the corresponding dictionaries in real time.
Empirical results: The approach achieves high open-ended accuracy (up to 94% on standard benchmarks) and scales to 40+ categories under continual learning.

6. Evaluation Metrics and Limitations

Systematic evaluation of open-ended feature discovery relies on adapted coverage, precision, and novelty metrics as in BADF (Pan et al., 2 Aug 2025), controlled rediscovery (alignment to known semantic concepts) as in (Stevens et al., 21 Nov 2025), unbounded entropy and diversity statistics (Faldor et al., 6 Jun 2024), and clustering metrics in object discovery (Zheng et al., 2022). A major limitation across domains is the reliance on rediscovery or surrogate metrics when true ground truth is unavailable in genuinely unannotated regimes. Long-tail and low-prevalence features remain more difficult to recover, motivating the need for improved sample mining and adaptive dictionary strategies.

Intrinsic open-endedness also faces unresolved theoretical issues: the precise measurement of unbounded novelty, continuous niche creation without capacity limits, and appropriate regularization to avoid degenerate solutions or redundancy (Faldor et al., 6 Jun 2024).

7. Future Directions and Open Challenges

Advances in architecture-agnostic segmentation and label-free semantic attribution (Chen et al., 23 Sep 2025), co-evolutionary descriptors (Faldor et al., 6 Jun 2024), monosemantic decomposition in foundation models (Stevens et al., 21 Nov 2025), and robust query fusion strategies for visual grounding (Lin et al., 25 May 2025) indicate converging trajectories toward scalable, extensible open-ended discovery systems.

Open questions persist regarding:

Formally proving or quantifying unboundedness in novel feature discovery.
Handling invariances (e.g., rotation, scale) and long-tail phenomena in unsupervised settings.
Integrating open-ended approaches with active learning or human-in-the-loop curation.
Systematic transference across modalities and domains, particularly for scientific archiving, protein design, and remote environmental monitoring.

Continual progress on these axes promises further emancipation from static taxonomies and supervised bottlenecks, advancing the capacity for autonomous discovery in artificial and scientific systems.