AptaFind: Automated Aptamer Screening

Updated 19 January 2026

AptaFind is an open-source system that automates aptamer sequence curation using a three-tiered, deterministic pipeline enhanced by semantic language models.
It employs local Llama3.2 execution combined with regex-based extraction to ensure high precision and minimize risks of language model hallucination.
Its geometric deep learning component, through the FAFormer model, enables zero-shot protein–aptamer binding prediction with superior screening precision.

AptaFind is an open-source, locally deployable system for automated curation and high-throughput screening of aptamer sequences from the scientific literature and for zero-shot in silico protein–aptamer binding prediction. Its architecture addresses the fragmented state of aptamer reporting—spread across journal bodies, supplements, and databases—and incorporates advanced language and transformer models to enable both semantic understanding and geometric deep learning. The platform explicitly reframes literature mining and screening as multi-tiered, deterministic processes, optimizing both precision and speed while remaining free from cloud dependencies or subscription barriers (Taghon, 12 Jan 2026, Huang et al., 2024).

1. System Design: Three-Tier Intelligence and Minimum Agentic Flow

AptaFind's literature curation operates on two guiding principles. First, it formalizes literature mining as a spectrum of possible intelligence outcomes, implemented as a three-tier architecture:

Tier 1: Direct sequence extraction from local PDF corpora using deterministic regex and validation, augmented by context-aware LLMs for semantic filtering and functional annotation.
Tier 2: Curated research leads presented when direct extraction fails, utilizing targeted PubMed and PMC queries, with extracted metadata to facilitate manual resolution.
Tier 3: Exhaustive literature coverage for maximal recall, generating comprehensive references via broad online searches even when no direct sequence can be found.

The second core principle, termed "Minimum Agentic Flow" (MAF), combines local execution of a 1-billion parameter Llama3.2 LM (semantic verification, target and binding extraction) with rule-based, regex, and deterministic transformation pipelines for sequence handling. This minimizes hallucination risk by confining LMs strictly to semantic inference tasks, while all data formatting, deduplication, and output steps are deterministic (Taghon, 12 Jan 2026).

2. Algorithmic Workflow and Modular Pipeline

AptaFind's algorithmic flow comprises three automated stages:

Tier	Core Function	Outcome Type
Tier 1	PDF mining, regex sequence finding, LM semantic check	Aptamer sequences
Tier 2	PubMed/PMC/supplement mining, meta extraction	Curated references
Tier 3	Broad queries, deduplicated literature list	Exhaustive awareness

In Tier 1, user-supplied PDFs are ingested after Markdown conversion and indexed via ColBERT (through RAGatouille), with deterministic regex ([ACGTU]{20,100}) surfacing candidate sequences. The Llama3.2 LM, running locally (~5 GB), evaluates each candidate's context (distinguishing genuine aptamers from primers, controls, or artifacts), harmonizes target nomenclature, and extracts affinity metrics (e.g., $K_d$ , $K_i$ ).

Semantic-pass sequences are further validated with length ([20,100] nt), nucleotide alphabet, and GC content ([20 %,80 %]); orientation is standardized (5′–3′), chemical modifications parsed, and deduplication is enforced at 100 % sequence identity, preserving single-base differences. Tier 2 is triggered if no validated sequence is extracted, orchestrating two layers of programmatic PubMed/PMC queries—one sequence-targeted, one broad—supplement and full-text retrieval, and Section-ID–based filtering of aptamer-relevant literature. Results are presented as structured, clickable references with full citation metadata.

If neither prior tier produces results, Tier 3 systematically issues relaxed, high-recall database and preprint searches (including across bioRxiv), returning all unique matching records for total literature awareness. All online queries are rate-limited (≤3 requests/s) with exponential backoff. Routing logic is strictly hierarchical, cascading only on prior tier failure. Deterministic handling dominates all non-semantic tasks, in line with MAF (Taghon, 12 Jan 2026).

3. Underlying Geometric Deep Learning: Unsupervised Protein–Aptamer Screening

AptaFind integrates a geometric deep learning pipeline for zero-shot in silico aptamer binding prediction, based on unsupervised residue–nucleotide contact map estimation. The central model, FAFormer (Frame-Averaging Transformer), operates as a 3-layer equivariant transformer encoding spatially augmented node and edge features for protein–nucleic acid complexes.

Contact Maps: Proteins $S = \{S_i\}$ and aptamers $S' = \{S'_j\}$ are represented by their 3D atomic coordinates; a contact between residue $i$ and nucleotide $j$ is defined as $y_{ij} = 1$ if $\min_{a \in S_i,\,b \in S'_j} \|x_a - x_b\| \leq 6\ \text{\AA}$ .
Latent Prediction: Protein and aptamer embeddings ( $z_i$ , $z'_j$ ) from FAFormer are concatenated and scored via an MLP: $p_{ij} = \sigma(W_1[z_i \, \| \, z'_j] + b_1)$ .
Loss: Training uses only crystal and cryo-EM complex data, optimizing a weighted binary cross-entropy over all pairings:

$L_{\rm contact} = - \sum_{i=1}^N \sum_{j=1}^{N'} \left[\alpha\,y_{ij}\,\log p_{ij} + (1-y_{ij})\,\log(1-p_{ij}) \right],\ \alpha \approx 4.$

Node Features: ESM2 is used for protein sequence embeddings, RNA-FM for RNA, and one-hot with position for DNA.

The FAFormer model enforces E(3)-equivariance at every step via frame-averaging (eight per-node PCA coordinate frames), and propagates invariant node and edge features with equivariant spatial updates. This ensures that learned contact probabilities $p_{ij}$ are rotation- and translation-invariant—a requirement for valid structure-based screening pipelines (Huang et al., 2024).

4. Screening and Curation Pipelines

For aptamer screening, the workflow is as follows:

Predict 3D structures of the protein and all candidate aptamers with tools such as ESMFold and RoseTTAFoldNA (or FAFormer’s structure head).
Construct bipartite graphs: protein residues and aptamer nucleotides are treated as nodes, and their spatial relations populated as per FAFormer design.
For each candidate, FAFormer derives embeddings and contact-map probabilities; a binding score is calculated as $\max_{i,j} p_{ij}$ (strongest single-pair contact).
Candidates are ranked by their binding score; top-K or threshold-based selection is supported.

This process is fully "zero-shot": FAFormer is trained only on protein–nucleic acid complex structures, with no aptamer screening labels or experimental affinity data required. Benchmarks include real-world aptamer–protein datasets with experimentally validated positives (e.g., GFP, NELF, HNRNPC, CHK2, UBLCP1), measuring Precision@10, Precision@50, and PR-AUC over ranked libraries. FAFormer exceeds previous methods (e.g., sequence-only transformers, SE(3)-Transformers, EGNN) by >10% relative F1/PR-AUC in contact prediction and up to +50% in top-10 screening precision (Huang et al., 2024).

5. Implementation, Benchmarking, and Performance

AptaFind is implemented in Python, exposing both command-line and GUI interfaces. Core dependencies are xmltodict, BeautifulSoup, Playwright, ColBERT, and the local Llama3.2 model. There is no requirement for cloud API keys or proprietary cloud compute; the pipeline executes locally given only an Internet connection for online queries. On a Mac Studio (M2 Max, 64 GB RAM), throughput approaches approximately 950 targets per hour ( $\sim$ 4 s/target).

Evaluation using the University of Texas Aptamer Database (UTdb) snapshot (555 ligands) over three random 100-target subsets yields the following per-tier recoveries (mean ± std):

Metric	Value
Tier 1 Direct Extraction	$79.3\% \pm 0.6\%$
Tier 2 Curated Lead Coverage	$84.0\% \pm 3.5\%$
Tier 3 Exhaustive Discovery Coverage	$84.0\% \pm 3.5\%$
Throughput	$954 \pm 43$ targets/hr

Across all queries, AptaFind returns aptamer sequences for $\sim$ 80%, curated research leads for $\sim$ 84%, and exhaustive references when no sequence or lead is found. This degree of automation substantially reduces the manual burden of aptamer curation and literature navigation (Taghon, 12 Jan 2026).

6. Use Cases, Strengths, Limitations, and Future Directions

Typical applications include rapid systematic screening of candidate protein targets, augmentation of private PDF libraries with structured sequence databases, and navigation of paywalled literature by surfacing reference metadata. Strengths notably include near real-time throughput, fully local and cost-free operation, and a three-tier logic that ensures actionable results for every query. The MAF principle—strict separation of LM semantic checks from deterministic sequence handling—mitigates LLM hallucination risks.

Notable limitations are the inability to parse paywalled full texts (unless PDFs are manually provided), no sequence extraction from images or non-text tables, and present challenges in extracting nonstandard or qualitative affinity data. Planned future enhancements are multimodal pipelines for image/table sequence parsing, improved supplement linking, broader normalization of affinity units, and expansion of the three-tier/MAF paradigm to small molecules and synthetic biology components.

In summary, AptaFind offers a reproducible, privacy-preserving, and scalable approach to the dual challenges of aptamer literature mining and in silico protein–aptamer binding prediction. By blending semantic LMs with deterministic processing and E(3)-equivariant geometric modeling, it supports both curation and screening use cases at high throughput and rigor (Taghon, 12 Jan 2026, Huang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

AptaFind: A lightweight local interface for automated aptamer curation from scientific literature (2026)

Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AptaFind.