Papers
Topics
Authors
Recent
Search
2000 character limit reached

Extreme-K categorical samples problem

Published 29 Jul 2020 in stat.AP, stat.ME, and stat.ML | (2007.15039v1)

Abstract: With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-$K$ sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable $K$, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon $K$ populations as a platform for discovering a wide spectrum of patterns among populations; 2) evaluate each geometric pattern's reliability. In CEDA developments, each population gives rise to a row vector of categories proportions. Upon the data matrix's row-axis, we discuss the pros and cons of Euclidean distance against its weighted version for building a binary clustering tree geometry. The criterion of choice rests on degrees of uniformness in column-blocks framed by this binary clustering tree. Each tree-leaf (population) is then encoded with a binary code sequence, so is tree-based pattern. For evaluating reliability, we adopt row-wise multinomial randomness to generate an ensemble of matrix mimicries, so an ensemble of mimicked binary trees. Reliability of any observed pattern is its recurrence rate within the tree ensemble. A high reliability value means a deterministic pattern. Our four applications of CEDA illuminate four significant aspects of extreme-$K$ sample problems.

Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.