Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data (1510.08119v1)
Abstract: In this paper we show how to efficiently produce unbiased estimates of subgraph frequencies from a probability sample of egocentric networks (i.e., focal nodes, their neighbors, and the induced subgraphs of ties among their neighbors). A key feature of our proposed method that differentiates it from prior methods is the use of egocentric data. Because of this, our method is suitable for estimation in large unknown graphs, is easily parallelizable, handles privacy sensitive network data (e.g. egonets with no neighbor labels), and supports counting of large subgraphs (e.g. maximal clique of size 205 in Section 6) by building on top of existing exact subgraph counting algorithms that may not support sampling. It gracefully handles a variety of sampling designs such as uniform or weighted independence or random walk sampling. Our method can be used for subgraphs that are: (i) undirected or directed; (ii) induced or non-induced; (iii) maximal or non-maximal; and (iv) potentially annotated with attributes. We compare our estimators on a variety of real-world graphs and sampling methods and provide suggestions for their use. Simulation shows that our method outperforms the state-of-the-art approach for relative subgraph frequencies by up to an order of magnitude for the same sample size. Finally, we apply our methodology to a rare sample of Facebook users across the social graph to estimate and interpret the clique size distribution and gender composition of cliques.