Witness-Based Similarity Systems (REWA)
- Witness-Based Similarity Systems (REWA) is a framework that formalizes similarity by encoding concepts as structured sets of witnesses, where overlap defines semantic relationships.
- The methodology leverages binary encodings and channel capacity bounds to achieve provably optimal bit complexities while ensuring rigorous ranking guarantees.
- The framework unifies classical methods such as Bloom filters, MinHash, and LSH through a composable design that emphasizes ranking preservation over absolute metric distances.
Witness-Based Similarity Systems (REWA) formalize similarity by encoding concepts as structured sets of “witnesses” whose overlaps define semantic relationships. This framework precisely links similarity search to information-theoretic principles, notably mutual information and channel capacity, yielding provably optimal bit complexities for large-scale retrieval. The witness-centric perspective enables modular construction, rigorous ranking guarantees, and unification of classical and modern similarity methods under a single foundational theory (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).
1. Formal Framework and Core Definitions
A witness-based similarity system is defined as a tuple , where is a finite universe of concepts and is the universe of witnesses. Each concept has a finite witness set . Similarity between any pair is encoded by the raw overlap, (Phadke, 29 Nov 2025, Phadke, 25 Nov 2025).
The associated random variable takes values in with probability mass function for the Boolean variant. Binary encodings are constructed as , mapping each concept to an m-bit vector, typically through semi-random bit assignments per witness. The expected Hamming inner product between codes, , is affine in .
2. Information-Theoretic Equivalence
A central result is the explicit isomorphism between witness overlap and mutual information. For concepts with uniform witness distributions,
In the symmetric case ,
Thus, semantic similarity, as quantified by witness overlap, obtains a physical unit: bits of mutual information. This connection is strictly monotonic with respect to normalized overlap and reveals that observing reveals bits about (Phadke, 29 Nov 2025).
3. Encoding Complexity: Channel Capacity and Lower Bounds
Binary encodings of witnesses constitute uses of a binary asymmetric channel:
where indicates true witness collision and observes accidental hash collision. The channel capacity governs how reliably similarity can be encoded and recovered.
The REWA capacity bound states:
for distinguishing neighbors (overlap ) from non-neighbors with failure probability in a database of size . No encoding can achieve ranking preservation with fewer bits, as given by the optimality lower bound (Phadke, 29 Nov 2025).
4. Ranking Preservation and Rate-Distortion
Beyond binary predicates, REWA encodings can be constructed to preserve top- neighbor rankings under a distortion metric:
Encoding length must obey rate-distortion constraints:
where is the binary entropy and for small . Distributing this across witness encodings recovers the complexity scaling (Phadke, 29 Nov 2025). REWA does not preserve metric distances, only ranking orderings.
5. Compositionality and Unified Reconstruction
REWA’s compositional design admits arbitrary data transformation pipelines terminating in finite witness sets. Any sequence of structural, causal, temporal, topological, or semantic transformations is admissible. The framework reconstructs several classical data structures as special cases:
| Classical System | REWA Witness Definition | Encoding Mechanism |
|---|---|---|
| Bloom Filter | Hash positions set to 1 | |
| MinHash (Jaccard) | : feature set | Permutation × witness tokens |
| SimHash/LSH Bitmap | Hyperplane tokens | Side of hyperplane → bit position |
| Hierarchical Filter | Multi-level cluster representatives | Aggregate all levels |
Millions of composable similarity definitions inherit logarithmic encoding complexity via the witness-set abstraction (Phadke, 25 Nov 2025).
6. Computational and Storage Complexity
Typical bit complexity is
where is witness set size and is the number of hash functions per witness. Encoding time per element is . Naïve query time scales as , but inverted-index methods often yield practical acceleration. Trade-offs include selection of for variance concentration and management of code density.
7. Limitations and Open Directions
REWA requires an overlap-gap condition: top- relevant concepts must be separated from non-neighbors by a gap in witness overlap. Guarantees degrade without such separation. Hash functions must approximately realize min-entropy; adversarial or correlated data can introduce errors. The theory is ranking-centric and assumes offline witness sets. Engineering aspects such as cache locality and distributed computation are orthogonal to REWA’s axioms.
Active research directions include multi-bit/weighted witnesses, learned witness selection, streaming/online REWA, differentiable encoding via Gumbel-Softmax relaxations, entropy-aware allocation, and extensions to non-set witness representations (multisets, structures, soft overlaps) (Phadke, 25 Nov 2025).
REWA systems embody a mathematical equivalence between witness-based similarity and Shannon information theory. Witness overlap realizes mutual information; encoding and retrieval correspond to communication over noisy channels; and top- ranking preservation is governed by rate-distortion theory. The REWA formalism generalizes and subsumes major similarity search paradigms, yields provably optimal encoding bounds, and suggests principled modularity for composable relational retrieval.