Prediction from compression for models with infinite memory, with applications to hidden Markov and renewal processes (2404.15454v1)
Abstract: Consider the problem of predicting the next symbol given a sample path of length n, whose joint distribution belongs to a distribution class that may have long-term memory. The goal is to compete with the conditional predictor that knows the true model. For both hidden Markov models (HMMs) and renewal processes, we determine the optimal prediction risk in Kullback- Leibler divergence up to universal constant factors. Extending existing results in finite-order Markov models [HJW23] and drawing ideas from universal compression, the proposed estimator has a prediction risk bounded by redundancy of the distribution class and a memory term that accounts for the long-range dependency of the model. Notably, for HMMs with bounded state and observation spaces, a polynomial-time estimator based on dynamic programming is shown to achieve the optimal prediction risk {\Theta}(log n/n); prior to this work, the only known result of this type is O(1/log n) obtained using Markov approximation [Sha+18]. Matching minimax lower bounds are obtained by making connections to redundancy and mutual information via a reduction argument.
- Kweku Abraham, Elisabeth Gassiat and Zacharie Naulet “Fundamental limits for learning hidden Markov model parameters” In IEEE Transactions on Information Theory 69.3 IEEE, 2022, pp. 1777–1794
- Grigory Alexandrovich, Hajo Holzmann and Anna Leister “Nonparametric identification and maximum likelihood estimation for hidden Markov models” In Biometrika 103.2 Oxford University Press, 2016, pp. 423–434
- “Tensor decompositions for learning latent variable models”, 2014 arXiv:1210.7559 [cs.LG]
- “Smoothed Analysis of Tensor Decompositions” In CoRR abs/1311.3651, 2013 arXiv: http://arxiv.org/abs/1311.3651
- John J. Birch “Approximations for the Entropy for Functions of Markov Chains” In The Annals of Mathematical Statistics 33.3 Institute of Mathematical Statistics, 1962, pp. 930–938
- Avrim Blum, Adam Kalai and Hal Wasserman “Noise-tolerant learning, the parity problem, and the statistical query model” In Journal of the ACM (JACM) 50.4 ACM New York, NY, USA, 2003, pp. 506–519
- “Redundancy rates for renewal and other processes” In IEEE Transactions on Information Theory 42.6, 1996, pp. 2065–2072 DOI: 10.1109/18.556596
- “Efficient universal noiseless source codes” In IEEE Transactions on Information Theory 27.3, 1981, pp. 269–279 DOI: 10.1109/TIT.1981.1056355
- L. Davisson “Universal noiseless coding” In IEEE Transactions on Information Theory 19.6, 1973, pp. 783–795 DOI: 10.1109/TIT.1973.1055092
- Yohann De Castro, Elisabeth Gassiat and Claire Lacour “Minimax Adaptive Estimation of Nonparametric Hidden Markov Models” In Journal of Machine Learning Research 17.111, 2016, pp. 1–43 URL: http://jmlr.org/papers/v17/15-381.html
- Yohann De Castro, Elisabeth Gassiat and Sylvain Le Corff “Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden Markov models” In IEEE Transactions on Information Theory 63.8 IEEE, 2017, pp. 4758–4777
- “Learning Markov distributions: Does estimation trump compression?” In 2016 IEEE International Symposium on Information Theory (ISIT), 2016, pp. 2689–2693 DOI: 10.1109/ISIT.2016.7541787
- Meir Feder, Neri Merhav and Michael Gutman “Universal prediction of individual sequences” In IEEE transactions on Information Theory 38.4 IEEE, 1992, pp. 1258–1270
- Vitaly Feldman, Will Perkins and Santosh Vempala “On the complexity of random satisfiability problems with planted solutions” In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, 2015, pp. 77–86
- “Analytic variations on redundancy rates of renewal processes” In IEEE Transactions on Information Theory 48.11, 2002, pp. 2911–2921 DOI: 10.1109/TIT.2002.804115
- Élisabeth Gassiat “Universal Coding and Order Identification by Model Selection Methods” Springer, 2018
- Yanjun Han, Soham Jana and Yihong Wu “Optimal prediction of Markov chains with and without spectral gap” In Advances in Neural Information Processing Systems 34, 2021, pp. 11233–11246
- Yanjun Han, Soham Jana and Yihong Wu “Optimal prediction of Markov chains with and without spectral gap” In IEEE Transactions on Information Theory 69.6, 2023, pp. 3920–3959
- David Haussler, Jyrki Kivinen and Manfred K Warmuth “Sequential prediction of individual sequences under general loss functions” In IEEE Transactions on Information Theory 44.5 IEEE, 1998, pp. 1906–1925
- Yi Hao, A. Orlitsky and V. Pichapati “On learning Markov chains” In In Advances in Neural Information Processing Systems, 2018, pp. 648–657
- Godfrey H Hardy and Srinivasa Ramanujan “Asymptotic formulaæ in combinatory analysis” In Proceedings of the London Mathematical Society 2.1 Wiley Online Library, 1918, pp. 75–115
- “Minimal realization problems for hidden markov models” In IEEE Transactions on Signal Processing 64.7 IEEE, 2015, pp. 1896–1904
- “Sum of squares lower bounds for refuting any CSP” In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017, pp. 132–145
- “The performance of universal encoding” In IEEE Transactions on Information Theory 27.2, 1981, pp. 199–207 DOI: 10.1109/TIT.1981.1056331
- Luc Lehéricy “Nonasymptotic control of the MLE for misspecified nonparametric hidden Markov models” In Electronic Journal of Statistics 15.2 The Institute of Mathematical Statisticsthe Bernoulli Society, 2021, pp. 4916–4965
- David A Levin and Yuval Peres “Markov chains and mixing times” American Mathematical Soc., 2017
- L. Mirsky “SYMMETRIC GAUGE FUNCTIONS AND UNITARILY INVARIANT NORMS” In Quarterly Journal of Mathematics 11, 1960, pp. 50–59 URL: https://api.semanticscholar.org/CorpusID:120585992
- “Learning nonsingular phylogenies and hidden Markov models” In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, 2005, pp. 366–375
- “Information Theory: From Coding to Learning” http://www.stat.yale.edu/~yw562/teaching/itbook-export.pdf Cambridge University Press, 2024
- J. Rissanen “Universal coding, information, prediction, and estimation” In IEEE Transactions on Information Theory 30.4, 1984, pp. 629–636 DOI: 10.1109/TIT.1984.1056936
- “Learning Overcomplete HMMs” In Advances in Neural Information Processing Systems (NeurIPS), 2017
- “Prediction with a short memory” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018, pp. 1074–1087
- G.W. Stewart “On the Continuity of the Generalized Inverse” In SIAM Journal on Applied Mathematics 17.1, 1969, pp. 33–45 DOI: 10.1137/0117004
- “Practically Solving LPN” In 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 2399–2404 DOI: 10.1109/ISIT45174.2021.9518109
- Qun Xie and Andrew R Barron “Asymptotic minimax regret for data compression, gambling, and prediction” In IEEE Transactions on Information Theory 46.2 IEEE, 2000, pp. 431–445
- “Information-theoretic determination of minimax rates of convergence” In Annals of Statistics JSTOR, 1999, pp. 1564–1599