Textual Spatial Cosine Similarity
- TSCS is a document similarity measure that blends cosine similarity with position-based metrics to capture both lexical overlap and word-order fidelity.
- It computes normalized spatial differences between term occurrences and integrates these with tf–idf vectors using a convex combination controlled by parameter α.
- Empirical evaluations show TSCS enhances paraphrase detection and stability in dynamic corpora, making it ideal for enterprise search, plagiarism checks, and content recommendation.
Textual Spatial Cosine Similarity (TSCS) is a document similarity measure designed to interpolate between conventional cosine similarity and a spatially-aware metric that captures the order and placement of words within documents. TSCS provides a framework for balancing purely lexical overlap with position-based resemblance, enabling robust detection of paraphrase and content similitude in large, real-time search systems using only linear computational resources (Crocetti, 2015).
1. Mathematical Foundation
The TSCS framework operates on two pre-processed documents, and , which undergo tokenization, stop-word removal, and stemming. For each term present in both documents, the positions of its -th occurrences are denoted (in ) and (in ); is the minimal count of across 0 and 1. Only matched 2 are considered, yielding 3 total matched term-instances.
For each matched pair 4, the normalized spatial difference is computed as:
5
with 6.
The spatial similarity, or Textual Space Similarity (TSS), is then defined:
7
where 8 indicates perfect positional alignment for all matched terms. In parallel, standard cosine similarity is given by
9
where 0 and 1 are the tf–idf vectors of the documents.
TSCS forms a convex combination of these scores:
2
with 3 controlling the balance. Two degenerate regimes emerge: 4 yields pure cosine similarity, and 5 yields a purely spatial (paraphrase-oriented) score.
2. Algorithmic Implementation and Complexity
The TSCS workflow comprises standard text pre-processing, tf–idf vectorization, and positional indexing. Algorithmic steps include:
- Generate term-position lists for all terms 6 in both documents.
- For each 7 present in both, align the 8 pairwise occurrences, summing normalized positional differences.
- Calculate the TSS term as described above. If no shared terms, TSS defaults to 1.
- Combine cosine and TSS scores according to 9 to yield the final TSCS.
Pseudocode is as follows:
3
Complexity is linear: 0 for pre-processing and vectorization; 1 for cosine computation (2 is total unique terms); 3 for the spatial matching (4). TSCS thus incurs only a small constant overhead beyond cosine alone, making it viable for enterprise-scale document retrieval.
3. Parameterization and Configuration
The key TSCS control parameter is the blending weight 5, which determines the contribution of lexical versus spatial factors:
- For typical "clean" text, 6 is recommended.
- When input is noisy or positional data is unreliable (e.g., after OCR), a higher 7 closer to 1 increases reliance on tf–idf matching.
- In paraphrase detection, setting 8 (pure TSS) empirically maximizes recall and accuracy.
TSCS has no further tunable hyperparameters, simplifying deployment and operational maintenance (Crocetti, 2015).
4. Empirical Evaluation
Empirical analysis establishes the effectiveness and robustness of TSCS across document similarity tasks.
Corpus Growth Stability
Experiments with growing background corpora found TSCS (with 9) shows lower sensitivity to corpus expansion than cosine similarity alone. For two seed-pairs, TSCS similarity varied 0–1 across seven corpus sizes, versus 2 for cosine. This property makes TSCS stable for environments with dynamic, incrementally growing corpora.
| Corpus size | TSCS Set #1 | TSCS Set #2 | Cosine Set #1 | Cosine Set #2 |
|---|---|---|---|---|
| min to max | 0.89→0.92 | 0.52→0.59 | 0.85→0.91 | 0.48→0.60 |
| Variation range | ~0.03–0.07 | ~0.03–0.07 | ~0.12 | ~0.12 |
Paraphrase Detection
TSCS substantially improves paraphrase recall over cosine. On textbook paraphrase examples, cosine scored 3 (below threshold), whereas TSCS (4) yielded 5, correctly surpassing a paraphrase threshold of 6. On the SemEval-2012 paraphrase dataset (734 pairs), cosine at threshold 7 recalled 8 of true paraphrases; TSCS with 9 recalled 0. Parameter sweeps confirm maximal performance for paraphrase detection at 1.
5. Applications
TSCS is intended for high-throughput environments where semantic sensitivity is required without the cost of full semantic modeling.
- Enterprise search engines can use TSCS to combine the lexical scope of tf–idf with word-order awareness while preserving real-time response guarantees.
- Plagiarism and paraphrase detection tasks benefit from the high recall of TSCS in scenarios involving reordered or minimally reworded content.
- Content recommendation and thematic discovery are augmented by TSCS's ability to group documents based on both structural and topical similarity.
Practical deployments can accelerate TSCS by caching term positions at indexing, pruning low-frequency tokens, or leveraging locality-sensitive hashing for comparison acceleration.
6. Limitations and Prospective Directions
TSCS addresses only positional (word-order) similarity, without modeling deep semantics, world knowledge, or anaphoric references. Its performance may degrade in noisy data scenarios (e.g., OCR noise or embedded HTML), necessitating higher 2 values to mitigate positional uncertainty.
Further investigation is warranted comparing TSCS to richer semantic metrics, including WordNet-based approaches and neural embeddings, especially for document clustering and classification tasks beyond cosine baselines (Crocetti, 2015). A plausible implication is that integrating TSCS with such measures may capture complementary modes of document similarity.
By uniting a simple position-based penalty with industry-standard tf–idf metrics, TSCS yields a continuum of similarity functions suitable for scalable, real-time applications demanding both lexical and structural fidelity.