Papers
Topics
Authors
Recent
Search
2000 character limit reached

Textual Spatial Cosine Similarity

Updated 10 February 2026
  • TSCS is a document similarity measure that blends cosine similarity with position-based metrics to capture both lexical overlap and word-order fidelity.
  • It computes normalized spatial differences between term occurrences and integrates these with tf–idf vectors using a convex combination controlled by parameter α.
  • Empirical evaluations show TSCS enhances paraphrase detection and stability in dynamic corpora, making it ideal for enterprise search, plagiarism checks, and content recommendation.

Textual Spatial Cosine Similarity (TSCS) is a document similarity measure designed to interpolate between conventional cosine similarity and a spatially-aware metric that captures the order and placement of words within documents. TSCS provides a framework for balancing purely lexical overlap with position-based resemblance, enabling robust detection of paraphrase and content similitude in large, real-time search systems using only linear computational resources (Crocetti, 2015).

1. Mathematical Foundation

The TSCS framework operates on two pre-processed documents, did_i and djd_j, which undergo tokenization, stop-word removal, and stemming. For each term tt present in both documents, the positions of its kk-th occurrences are denoted pit,kp_i^{t,k} (in did_i) and pjt,kp_j^{t,k} (in djd_j); mtm_t is the minimal count of tt across djd_j0 and djd_j1. Only matched djd_j2 are considered, yielding djd_j3 total matched term-instances.

For each matched pair djd_j4, the normalized spatial difference is computed as:

djd_j5

with djd_j6.

The spatial similarity, or Textual Space Similarity (TSS), is then defined:

djd_j7

where djd_j8 indicates perfect positional alignment for all matched terms. In parallel, standard cosine similarity is given by

djd_j9

where tt0 and tt1 are the tf–idf vectors of the documents.

TSCS forms a convex combination of these scores:

tt2

with tt3 controlling the balance. Two degenerate regimes emerge: tt4 yields pure cosine similarity, and tt5 yields a purely spatial (paraphrase-oriented) score.

2. Algorithmic Implementation and Complexity

The TSCS workflow comprises standard text pre-processing, tf–idf vectorization, and positional indexing. Algorithmic steps include:

  • Generate term-position lists for all terms tt6 in both documents.
  • For each tt7 present in both, align the tt8 pairwise occurrences, summing normalized positional differences.
  • Calculate the TSS term as described above. If no shared terms, TSS defaults to 1.
  • Combine cosine and TSS scores according to tt9 to yield the final TSCS.

Pseudocode is as follows:

did_i3

Complexity is linear: kk0 for pre-processing and vectorization; kk1 for cosine computation (kk2 is total unique terms); kk3 for the spatial matching (kk4). TSCS thus incurs only a small constant overhead beyond cosine alone, making it viable for enterprise-scale document retrieval.

3. Parameterization and Configuration

The key TSCS control parameter is the blending weight kk5, which determines the contribution of lexical versus spatial factors:

  • For typical "clean" text, kk6 is recommended.
  • When input is noisy or positional data is unreliable (e.g., after OCR), a higher kk7 closer to 1 increases reliance on tf–idf matching.
  • In paraphrase detection, setting kk8 (pure TSS) empirically maximizes recall and accuracy.

TSCS has no further tunable hyperparameters, simplifying deployment and operational maintenance (Crocetti, 2015).

4. Empirical Evaluation

Empirical analysis establishes the effectiveness and robustness of TSCS across document similarity tasks.

Corpus Growth Stability

Experiments with growing background corpora found TSCS (with kk9) shows lower sensitivity to corpus expansion than cosine similarity alone. For two seed-pairs, TSCS similarity varied pit,kp_i^{t,k}0–pit,kp_i^{t,k}1 across seven corpus sizes, versus pit,kp_i^{t,k}2 for cosine. This property makes TSCS stable for environments with dynamic, incrementally growing corpora.

Corpus size TSCS Set #1 TSCS Set #2 Cosine Set #1 Cosine Set #2
min to max 0.89→0.92 0.52→0.59 0.85→0.91 0.48→0.60
Variation range ~0.03–0.07 ~0.03–0.07 ~0.12 ~0.12

Paraphrase Detection

TSCS substantially improves paraphrase recall over cosine. On textbook paraphrase examples, cosine scored pit,kp_i^{t,k}3 (below threshold), whereas TSCS (pit,kp_i^{t,k}4) yielded pit,kp_i^{t,k}5, correctly surpassing a paraphrase threshold of pit,kp_i^{t,k}6. On the SemEval-2012 paraphrase dataset (734 pairs), cosine at threshold pit,kp_i^{t,k}7 recalled pit,kp_i^{t,k}8 of true paraphrases; TSCS with pit,kp_i^{t,k}9 recalled did_i0. Parameter sweeps confirm maximal performance for paraphrase detection at did_i1.

5. Applications

TSCS is intended for high-throughput environments where semantic sensitivity is required without the cost of full semantic modeling.

  • Enterprise search engines can use TSCS to combine the lexical scope of tf–idf with word-order awareness while preserving real-time response guarantees.
  • Plagiarism and paraphrase detection tasks benefit from the high recall of TSCS in scenarios involving reordered or minimally reworded content.
  • Content recommendation and thematic discovery are augmented by TSCS's ability to group documents based on both structural and topical similarity.

Practical deployments can accelerate TSCS by caching term positions at indexing, pruning low-frequency tokens, or leveraging locality-sensitive hashing for comparison acceleration.

6. Limitations and Prospective Directions

TSCS addresses only positional (word-order) similarity, without modeling deep semantics, world knowledge, or anaphoric references. Its performance may degrade in noisy data scenarios (e.g., OCR noise or embedded HTML), necessitating higher did_i2 values to mitigate positional uncertainty.

Further investigation is warranted comparing TSCS to richer semantic metrics, including WordNet-based approaches and neural embeddings, especially for document clustering and classification tasks beyond cosine baselines (Crocetti, 2015). A plausible implication is that integrating TSCS with such measures may capture complementary modes of document similarity.

By uniting a simple position-based penalty with industry-standard tf–idf metrics, TSCS yields a continuum of similarity functions suitable for scalable, real-time applications demanding both lexical and structural fidelity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Textual Spatial Cosine Similarity (TSCS).