Proposal Similarity via Embedding Models
- The paper presents a method that maps proposal texts into fixed-length embeddings using transformer encoders and static models to compute cosine similarity.
- The approach standardizes unstructured proposals through OCR, tokenization, and pooling, ensuring robust transformation into high-dimensional vectors.
- Empirical results demonstrate that thresholding and clustering of cosine similarities can effectively flag near-duplicates, reducing manual review efforts.
Proposal similarity via embedding models refers to the quantitative measurement of semantic overlap between research proposals by mapping variable-length proposal texts into fixed-dimensional vector representations and comparing these embeddings through geometrically grounded functions, most commonly cosine similarity. This methodology provides a fast, scalable, and internally consistent alternative to manual comparative reading, and has seen rapid adoption in large-scale review processes, scientific proposal selection, patent analysis, and related high-stakes document ranking domains (Ding et al., 11 Dec 2025, Ascione et al., 25 Mar 2024, Marjieh et al., 2022).
1. Embedding Model Architectures and Representation Choices
Modern proposal similarity pipelines center on transformer-based encoders, static embedding models, or hybrid concept representations:
- Transformer Encoders: Ding et al. ("LLMs Can Assist with Proposal Selection at Large User Facilities" (Ding et al., 11 Dec 2025)) utilize the Qwen3-embedding-8b model, an 8B-parameter transformer trained to map natural language proposals directly to 4096-dimensional dense vectors, with no additional task- or domain-specific fine-tuning. Sentence-Transformers (SBERT) and BERT derivatives are widely used for domain-adaptation via bi-encoder and Siamese architectures, supporting both generic and specialized domains (Ascione et al., 25 Mar 2024).
- Static Embeddings: word2vec and doc2vec offer 300-dimensional static representations by learning co-occurrence statistics over large text corpora (e.g., patent abstracts, funding proposals) (Ascione et al., 25 Mar 2024, Yang et al., 2017).
- Conceptual/Semantic Embedding: Approaches such as Semantic Concept Embeddings (CE) rely on deep syntactico-semantic parsing and graph-walk semantics to generate graph-informed embeddings that address word-sense disambiguation and semantic roles (Brück et al., 9 Jan 2024).
The choice of base representation impacts expressivity, computational cost, domain adaptation, and susceptibility to corpus artifacts.
2. Preprocessing Pipeline: From Natural Document to Fixed-Length Embedding
The preprocessing workflow standardizes heterogeneous proposal formats and ensures compatibility with the chosen embedding architecture:
- Text Extraction: Conversion from native formats (e.g., PDF) via OCR as in Ding et al. (Ding et al., 11 Dec 2025), ensuring all proposals are processed into plain text.
- Tokenization: Subword or word-level tokenization (e.g., WordPiece for BERT, standard whitespace for word2vec) is applied for downstream embedding.
- Passage Treatment: For long documents exceeding window limits, chunk-wise or hierarchical strategies (sentence/paragraph pooling, windowed CNNs) are employed to generate global embeddings (Dimitrov, 2020).
- Pooling Mechanism: Mean pooling across token vectors, [CLS] representations, or section-specific pooling are commonly used. For concept-based approaches, graph node aggregations or tf–idf weighting over semantic nodes are performed (Brück et al., 9 Jan 2024).
This structured preprocessing supports the mapping of multi-page, variable-length proposals into compact representation vectors or densities.
3. Similarity Computation: Core Formulations
The canonical metric for proposal similarity in embedding space is the cosine similarity:
where are the embedding vectors of proposals and . This formulation is standard across transformer-based, word2vec, doc2vec, and semantic concept models (Ding et al., 11 Dec 2025, Ascione et al., 25 Mar 2024, Marjieh et al., 2022, Brück et al., 9 Jan 2024). When embeddings are prenormalized to unit norm (as with Qwen3-embedding-8b), the raw outputs may be compared without further rescaling (Ding et al., 11 Dec 2025).
Alternative approaches may use Jensen–Shannon divergence (for density representations (Rushkin, 2020)) or kernel/fundamental matrix reconstruction errors (SLKE; (Kang et al., 2019)), but these remain less common in large-scale, production proposal review.
4. Thresholding, Clustering, and Interpretation
Post-similarity computation, the similarity matrix is interpreted for practical review tasks:
- Thresholding: No universal "magic" threshold exists. Ding et al. suggest practical investigative thresholds (e.g., cosine similarity > 0.75) to flag potentially duplicative proposals (Ding et al., 11 Dec 2025). Empirically, intra-cluster similarities concentrate around 0.6–0.8, while random pairs fall closer to 0.2–0.4.
- Clustering: Once proposals are embedded, unsupervised clustering algorithms (e.g., DBSCAN, hierarchical agglomerative clustering) are applied to the similarity (or distance) matrix to group proposals by topic, identify near-duplicates, and surface natural thematic structure (Ding et al., 11 Dec 2025).
- Statistical Analysis: Intra- and inter-cluster distributions, mean/variance analysis, and outlier detection (resubmissions, cross-team overlaps) are used to validate the method and guide human curation (Ding et al., 11 Dec 2025).
5. Quantitative Effectiveness, Calibration, and Human Alignment
Embedding-based similarity pipelines offer robust empirical separation between true thematic overlaps and semantic noise:
- Case Study (SNS Proposals): Cosine similarities above 0.8 accurately identify actual resubmissions or near-duplicate proposals. Cycle-wide background means are typically ∼0.35 (std 0.15), underscoring the effectiveness of the approach in separating genuine thematic overlap from background variation (Ding et al., 11 Dec 2025).
- Calibration to Human Judgments: Procedures from Marjieh et al. (Marjieh et al., 2022) and others advocate collection of a (small) manually judged calibration set (e.g., Likert-scaled similarity on 200–500 proposal pairs), then fitting a ridge regression to map raw cosine similarities onto the empirical human scale, improving alignment and interpretability.
- Performance Metrics: Human-validated evaluations (Spearman ρ, R², precision@k, recall@k) confirm that embedding-driven approaches match or exceed traditional methods in top-k retrieval and pairwise discrimination, at orders-of-magnitude lower computational and personnel cost (Ding et al., 11 Dec 2025, Ascione et al., 25 Mar 2024, Ravfogel et al., 2023).
Table: Empirical Ranges for Proposal Similarity Interpretation (Ding et al., 11 Dec 2025)
| Similarity Range | Interpretation | Typical Action |
|---|---|---|
| > 0.75 | Potential near duplicates | Flag for manual review |
| 0.6–0.8 | Thematically similar | Candidate for cluster co-membership |
| 0.35 ± 0.15 (mean ± sd) | Background/typical baseline | No action |
| 0.2–0.4 | Semantically distinct | Not similar |
6. Scalability, Computational Efficiency, and Operationalization
Scalability is among the principal advantages of embedding-based similarity:
- Computational Cost: Embedding generation is in the number of proposals (each a single forward pass per proposal through the encoder), while the full similarity matrix is computed with highly optimized batched matrix–vector multiplications, incurring negligible overhead even for (Ding et al., 11 Dec 2025).
- Throughput: Ding et al. report end-to-end vectorization of 50–100 proposals in under one minute using modern inference-optimized GPUs, with matrix similarity evaluation in seconds (Ding et al., 11 Dec 2025).
- Human Labor Cost Comparison: Embedding approach replaces the need for manual readings (prohibitively high when is large), instead providing a consistent, unbiased, and reproducible similarity assessment.
Operational deployment typically consists of (i) an offline indexing phase (embedding all proposals into a search index such as FAISS), (ii) on-demand similarity computation, and (iii) built-in threshold/cluster logic for flagging, reporting, and integration with human review workflows (Ascione et al., 25 Mar 2024, Ding et al., 11 Dec 2025).
7. Methodological Extensions and Domain Adaptation
Proposal similarity via embeddings is highly extensible:
- Domain Adaptation: For specialized domains (e.g., patent law, grant funding), domain-adapted SBERT models fine-tuned on large, weakly-labeled or triplet-constructed same-topic/different-topic proposal pairs show significant gains, particularly when a suitable triplet margin and negative mining strategy is adopted (Ascione et al., 25 Mar 2024).
- Hybrid and Conceptual Embeddings: Random-walk concept embeddings and fusion with word2vec centroids offer improved disambiguation in high-polysemy corpora and better capture abstract semantics in proposals (Brück et al., 9 Jan 2024).
- Probabilistic and Kernel-Preserving Methods: Probabilistic frameworks (model-comparison via AIC (Vargas et al., 2019)) and kernel-preserving embeddings (SLKE (Kang et al., 2019)) provide alternatives for groupwise and manifold-aware similarity, though with higher computational cost and limited adoption in live proposal workflows.
- Calibration and Expansion: Human-in-the-loop calibration, use of description-based or concept-based aggregation, and advanced cluster-matching (e.g., NNGS for multi-scale graph-structural analysis (Tavares et al., 13 Nov 2024)) extend the utility and interpretability of similarity metrics.
In summary, embedding-based proposal similarity models have become central to scalable, reproducible, and semantically consistent assessment in review-heavy domains. Transformer-based encoders, careful preprocessing, and robust clustering and calibration pipelines enable rapid mapping of large, heterogeneous proposal sets into interpretable similarity spaces, substantially augmenting both automated and human-driven selection processes (Ding et al., 11 Dec 2025, Ascione et al., 25 Mar 2024, Marjieh et al., 2022).