Papers
Topics
Authors
Recent
2000 character limit reached

Witness-Based Similarity Systems (REWA)

Updated 6 December 2025
  • Witness-Based Similarity Systems (REWA) is a framework that formalizes similarity by encoding concepts as structured sets of witnesses, where overlap defines semantic relationships.
  • The methodology leverages binary encodings and channel capacity bounds to achieve provably optimal bit complexities while ensuring rigorous ranking guarantees.
  • The framework unifies classical methods such as Bloom filters, MinHash, and LSH through a composable design that emphasizes ranking preservation over absolute metric distances.

Witness-Based Similarity Systems (REWA) formalize similarity by encoding concepts as structured sets of “witnesses” whose overlaps define semantic relationships. This framework precisely links similarity search to information-theoretic principles, notably mutual information and channel capacity, yielding provably optimal bit complexities for large-scale retrieval. The witness-centric perspective enables modular construction, rigorous ranking guarantees, and unification of classical and modern similarity methods under a single foundational theory (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).

1. Formal Framework and Core Definitions

A witness-based similarity system is defined as a tuple (V,Ω,{W(v)}vV,Δ)(\mathcal V, \Omega, \{W(v)\}_{v \in \mathcal V}, \Delta), where V\mathcal V is a finite universe of concepts and Ω\Omega is the universe of witnesses. Each concept vVv \in \mathcal V has a finite witness set W(v)ΩW(v) \subseteq \Omega. Similarity between any pair (u,v)(u, v) is encoded by the raw overlap, Δ(u,v)=W(u)W(v)\Delta(u, v) = |W(u) \cap W(v)| (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).

The associated random variable WvW_v takes values in Ω\Omega with probability mass function pv(w)=1W(v)p_v(w) = \frac{1}{|W(v)|} for the Boolean variant. Binary encodings are constructed as B:V{0,1}mB: \mathcal V \to \{0,1\}^m, mapping each concept to an m-bit vector, typically through semi-random bit assignments per witness. The expected Hamming inner product between codes, E[B(u),B(v)]\mathbb{E}[\langle B(u), B(v)\rangle], is affine in Δ(u,v)\Delta(u, v).

2. Information-Theoretic Equivalence

A central result is the explicit isomorphism between witness overlap and mutual information. For concepts x,yx, y with uniform witness distributions,

I(Wx;Wy)=logW(x)W(y)W(x)W(y)I(W_x ; W_y) = \log\frac{|W(x)| \cdot |W(y)|}{|W(x) \cup W(y)|}

In the symmetric case W(x)=W(y)=L|W(x)| = |W(y)| = L,

I(Wx;Wy)=logL22LΔ(x,y)I(W_x ; W_y) = \log\frac{L^2}{2L - \Delta(x, y)}

Thus, semantic similarity, as quantified by witness overlap, obtains a physical unit: bits of mutual information. This connection is strictly monotonic with respect to normalized overlap and reveals that observing xx reveals II bits about yy (Phadke, 29 Nov 2025).

3. Encoding Complexity: Channel Capacity and Lower Bounds

Binary encodings of witnesses constitute uses of a binary asymmetric channel:

P[Z=1X=1]=1,P[Z=1X=0]=p=1mP[Z=1|X=1]=1, \quad P[Z=1|X=0]=p=\frac{1}{m}

where XX indicates true witness collision and ZZ observes accidental hash collision. The channel capacity C(Δ)=Θ(Δ2)C(\Delta)=\Theta(\Delta^2) governs how reliably similarity can be encoded and recovered.

The REWA capacity bound states:

mlog(N/δ)C(Δ)=O(Δ2log(N/δ))m \geq \frac{\log(N/\delta)}{C(\Delta)} = O(\Delta^{-2} \log(N/\delta))

for distinguishing neighbors (overlap Δ\Delta) from non-neighbors with failure probability δ\delta in a database of size NN. No encoding can achieve ranking preservation with fewer bits, as given by the optimality lower bound mΩ(Δ2log(N/δ))m \geq \Omega(\Delta^{-2} \log(N/\delta)) (Phadke, 29 Nov 2025).

4. Ranking Preservation and Rate-Distortion

Beyond binary predicates, REWA encodings can be constructed to preserve top-kk neighbor rankings under a dkd_k distortion metric:

dk(Rq,R^q)=1[Topk(R^q)Topk(Rq)]d_k(\mathcal R_q, \widehat{\mathcal R}_q) = 1[\operatorname{Top}_k(\widehat{\mathcal R}_q) \neq \operatorname{Top}_k(\mathcal R_q)]

Encoding length must obey rate-distortion constraints:

mR(ϵ)log(Nk)H2(ϵ)m \geq R(\epsilon) \approx \log\binom{N}{k} - H_2(\epsilon)

where H2(ϵ)H_2(\epsilon) is the binary entropy and R(ϵ)klog(N/k)R(\epsilon) \approx k\log(N/k) for small ϵ\epsilon. Distributing this across witness encodings recovers the Δ2logN\Delta^{-2} \log N complexity scaling (Phadke, 29 Nov 2025). REWA does not preserve metric distances, only ranking orderings.

5. Compositionality and Unified Reconstruction

REWA’s compositional design admits arbitrary data transformation pipelines terminating in finite witness sets. Any sequence of structural, causal, temporal, topological, or semantic transformations is admissible. The framework reconstructs several classical data structures as special cases:

Classical System REWA Witness Definition Encoding Mechanism
Bloom Filter W(v)={v}W(v) = \{v\} Hash positions set to 1
MinHash (Jaccard) W(v)W(v): feature set Permutation × witness tokens
SimHash/LSH Bitmap Hyperplane tokens Side of hyperplane → bit position
Hierarchical Filter Multi-level cluster representatives Aggregate all levels

Millions of composable similarity definitions inherit logarithmic encoding complexity via the witness-set abstraction (Phadke, 25 Nov 2025).

6. Computational and Storage Complexity

Typical bit complexity is

m=O(LKΔ2(logV+log(1/δ)))m = O\left(\frac{L}{K\Delta^2}(\log|V|+\log(1/\delta))\right)

where LL is witness set size and KK is the number of hash functions per witness. Encoding time per element is O(KW(v))O(K\cdot|W(v)|). Naïve query time scales as O(Vm)O(|V|\cdot m), but inverted-index methods often yield practical acceleration. Trade-offs include selection of KK for variance concentration and management of code density.

7. Limitations and Open Directions

REWA requires an overlap-gap condition: top-kk relevant concepts must be separated from non-neighbors by a gap Δ>0\Delta > 0 in witness overlap. Guarantees degrade without such separation. Hash functions must approximately realize min-entropy; adversarial or correlated data can introduce errors. The theory is ranking-centric and assumes offline witness sets. Engineering aspects such as cache locality and distributed computation are orthogonal to REWA’s axioms.

Active research directions include multi-bit/weighted witnesses, learned witness selection, streaming/online REWA, differentiable encoding via Gumbel-Softmax relaxations, entropy-aware allocation, and extensions to non-set witness representations (multisets, structures, soft overlaps) (Phadke, 25 Nov 2025).


REWA systems embody a mathematical equivalence between witness-based similarity and Shannon information theory. Witness overlap realizes mutual information; encoding and retrieval correspond to communication over noisy channels; and top-kk ranking preservation is governed by rate-distortion theory. The REWA formalism generalizes and subsumes major similarity search paradigms, yields provably optimal encoding bounds, and suggests principled modularity for composable relational retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Witness-Based Similarity Systems (REWA).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube