Witness-Based Similarity Systems (REWA)

Updated 6 December 2025

Witness-Based Similarity Systems (REWA) is a framework that formalizes similarity by encoding concepts as structured sets of witnesses, where overlap defines semantic relationships.
The methodology leverages binary encodings and channel capacity bounds to achieve provably optimal bit complexities while ensuring rigorous ranking guarantees.
The framework unifies classical methods such as Bloom filters, MinHash, and LSH through a composable design that emphasizes ranking preservation over absolute metric distances.

Witness-Based Similarity Systems (REWA) formalize similarity by encoding concepts as structured sets of “witnesses” whose overlaps define semantic relationships. This framework precisely links similarity search to information-theoretic principles, notably mutual information and channel capacity, yielding provably optimal bit complexities for large-scale retrieval. The witness-centric perspective enables modular construction, rigorous ranking guarantees, and unification of classical and modern similarity methods under a single foundational theory (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).

1. Formal Framework and Core Definitions

A witness-based similarity system is defined as a tuple $(\mathcal V, \Omega, \{W(v)\}_{v \in \mathcal V}, \Delta)$ , where $\mathcal V$ is a finite universe of concepts and $\Omega$ is the universe of witnesses. Each concept $v \in \mathcal V$ has a finite witness set $W(v) \subseteq \Omega$ . Similarity between any pair $(u, v)$ is encoded by the raw overlap, $\Delta(u, v) = |W(u) \cap W(v)|$ (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).

The associated random variable $W_v$ takes values in $\Omega$ with probability mass function $p_v(w) = \frac{1}{|W(v)|}$ for the Boolean variant. Binary encodings are constructed as $B: \mathcal V \to \{0,1\}^m$ , mapping each concept to an m-bit vector, typically through semi-random bit assignments per witness. The expected Hamming inner product between codes, $\mathbb{E}[\langle B(u), B(v)\rangle]$ , is affine in $\Delta(u, v)$ .

2. Information-Theoretic Equivalence

A central result is the explicit isomorphism between witness overlap and mutual information. For concepts $x, y$ with uniform witness distributions,

$I(W_x ; W_y) = \log\frac{|W(x)| \cdot |W(y)|}{|W(x) \cup W(y)|}$

In the symmetric case $|W(x)| = |W(y)| = L$ ,

$I(W_x ; W_y) = \log\frac{L^2}{2L - \Delta(x, y)}$

Thus, semantic similarity, as quantified by witness overlap, obtains a physical unit: bits of mutual information. This connection is strictly monotonic with respect to normalized overlap and reveals that observing $x$ reveals $I$ bits about $y$ (Phadke, 29 Nov 2025).

3. Encoding Complexity: Channel Capacity and Lower Bounds

Binary encodings of witnesses constitute uses of a binary asymmetric channel:

$P[Z=1|X=1]=1, \quad P[Z=1|X=0]=p=\frac{1}{m}$

where $X$ indicates true witness collision and $Z$ observes accidental hash collision. The channel capacity $C(\Delta)=\Theta(\Delta^2)$ governs how reliably similarity can be encoded and recovered.

The REWA capacity bound states:

$m \geq \frac{\log(N/\delta)}{C(\Delta)} = O(\Delta^{-2} \log(N/\delta))$

for distinguishing neighbors (overlap $\Delta$ ) from non-neighbors with failure probability $\delta$ in a database of size $N$ . No encoding can achieve ranking preservation with fewer bits, as given by the optimality lower bound $m \geq \Omega(\Delta^{-2} \log(N/\delta))$ (Phadke, 29 Nov 2025).

4. Ranking Preservation and Rate-Distortion

Beyond binary predicates, REWA encodings can be constructed to preserve top- $k$ neighbor rankings under a $d_k$ distortion metric:

$d_k(\mathcal R_q, \widehat{\mathcal R}_q) = 1[\operatorname{Top}_k(\widehat{\mathcal R}_q) \neq \operatorname{Top}_k(\mathcal R_q)]$

Encoding length must obey rate-distortion constraints:

$m \geq R(\epsilon) \approx \log\binom{N}{k} - H_2(\epsilon)$

where $H_2(\epsilon)$ is the binary entropy and $R(\epsilon) \approx k\log(N/k)$ for small $\epsilon$ . Distributing this across witness encodings recovers the $\Delta^{-2} \log N$ complexity scaling (Phadke, 29 Nov 2025). REWA does not preserve metric distances, only ranking orderings.

5. Compositionality and Unified Reconstruction

REWA’s compositional design admits arbitrary data transformation pipelines terminating in finite witness sets. Any sequence of structural, causal, temporal, topological, or semantic transformations is admissible. The framework reconstructs several classical data structures as special cases:

Classical System	REWA Witness Definition	Encoding Mechanism
Bloom Filter	$W(v) = \{v\}$	Hash positions set to 1
MinHash (Jaccard)	$W(v)$ : feature set	Permutation × witness tokens
SimHash/LSH Bitmap	Hyperplane tokens	Side of hyperplane → bit position
Hierarchical Filter	Multi-level cluster representatives	Aggregate all levels

Millions of composable similarity definitions inherit logarithmic encoding complexity via the witness-set abstraction (Phadke, 25 Nov 2025).

6. Computational and Storage Complexity

Typical bit complexity is

$m = O\left(\frac{L}{K\Delta^2}(\log|V|+\log(1/\delta))\right)$

where $L$ is witness set size and $K$ is the number of hash functions per witness. Encoding time per element is $O(K\cdot|W(v)|)$ . Naïve query time scales as $O(|V|\cdot m)$ , but inverted-index methods often yield practical acceleration. Trade-offs include selection of $K$ for variance concentration and management of code density.

7. Limitations and Open Directions

REWA requires an overlap-gap condition: top- $k$ relevant concepts must be separated from non-neighbors by a gap $\Delta > 0$ in witness overlap. Guarantees degrade without such separation. Hash functions must approximately realize min-entropy; adversarial or correlated data can introduce errors. The theory is ranking-centric and assumes offline witness sets. Engineering aspects such as cache locality and distributed computation are orthogonal to REWA’s axioms.

Active research directions include multi-bit/weighted witnesses, learned witness selection, streaming/online REWA, differentiable encoding via Gumbel-Softmax relaxations, entropy-aware allocation, and extensions to non-set witness representations (multisets, structures, soft overlaps) (Phadke, 25 Nov 2025).

REWA systems embody a mathematical equivalence between witness-based similarity and Shannon information theory. Witness overlap realizes mutual information; encoding and retrieval correspond to communication over noisy channels; and top- $k$ ranking preservation is governed by rate-distortion theory. The REWA formalism generalizes and subsumes major similarity search paradigms, yields provably optimal encoding bounds, and suggests principled modularity for composable relational retrieval.

PDF Markdown Chat (Pro)

References (2)

The Information Theory of Similarity (2025)

REWA: Witness-Overlap Theory -- Foundations for Composable Binary Similarity Systems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Witness-Based Similarity Systems (REWA).

Witness-Based Similarity Systems (REWA)

1. Formal Framework and Core Definitions

2. Information-Theoretic Equivalence

3. Encoding Complexity: Channel Capacity and Lower Bounds

4. Ranking Preservation and Rate-Distortion

5. Compositionality and Unified Reconstruction

6. Computational and Storage Complexity

7. Limitations and Open Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Witness-Based Similarity Systems (REWA)

1. Formal Framework and Core Definitions

2. Information-Theoretic Equivalence

3. Encoding Complexity: Channel Capacity and Lower Bounds

4. Ranking Preservation and Rate-Distortion

5. Compositionality and Unified Reconstruction

6. Computational and Storage Complexity

7. Limitations and Open Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research