Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss (2402.05928v3)

Published 8 Feb 2024 in cs.LG and stat.ML

Abstract: In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $|f|{\Psi_p} \triangleq \sup{m\geq 1} m^{-1/p} |f|{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $|f|{\Psi_p} \lesssim |f|_{L^{2}^\eta$} for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that the empirical risk minimizer converges at near mixing-free rates despite dependent data, avoiding multiplicative deflation of effective sample sizes.
It leverages weakly sub-Gaussian classes and a refined Bernstein’s inequality with mixed-tail generic chaining to mitigate core challenges in dependent data scenarios.
The results are applicable to various hypothesis classes, enhancing practical performance in forecasting and adaptive control systems through improved learning convergence.

Addressing Dependent Data in Statistical Learning: Achieving Near Mixing-Free Rates

Introduction

One of the persistent challenges in statistical learning theory involves the analysis and processing of dependent data. This challenge is particularly prevalent in scenarios where data exhibits temporal dependencies, common in forecasting and control systems. Traditionally, learning algorithms have been optimized for handling independent and identically distributed (i.i.d.) samples, an assumption that does not hold in many practical applications. The shift from i.i.d. to dependent data necessitates a reevaluation of the theoretical underpinnings of learning algorithms to ensure their efficacy in broader contexts.

Challenges with Dependent Data

A significant hurdle in extending i.i.d. learning theory results to dependent settings involves addressing sample size deflation due to the covariance structure inherent in dependent data. This deflation is often a consequence of employing the blocking technique, which partitions data into blocks to approximate independence at the cost of effectively reducing the sample size. In the context of the square loss function, overcoming this hurdle without imposing strong realizability assumptions has been notably challenging.

Our Approach

In this work, we propose a method that effectively mitigates sample size deflation for a broad array of hypothesis classes and loss functions, focusing specifically on the square loss. By leveraging the notion of weakly sub-Gaussian classes and refining Bernstein’s inequality in conjunction with mixed-tail generic chaining, we demonstrate that it is possible to achieve near mixing-free rates of convergence. These rates principally depend on the class's complexity and second-order statistics, relegating the direct dependence on the mixing times to additive higher-order terms.

Results

Our main contribution lies in demonstrating that the empirical risk minimizer (ERM) converges at a rate that is essentially independent of the data's mixing time, post a specified burn-in period. This is a substantial departure from prior works where convergence rates were adversely affected by mixing times, leading to multiplicative deflation of effective sample sizes. Our findings are applicable across several examples, including but not limited to sub-Gaussian linear regression, smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

Implications and Future Directions

The theoretical advancements presented in this paper have both practical and theoretical ramifications. Practically, the ability to achieve near mixing-free rates opens the door to more efficient and effective learning from temporally dependent data. This improvement can significantly impact various applications, including predictive modeling and adaptive control systems, where temporal dependencies are pervasive.

Theoretically, our work contributes to the ongoing efforts in understanding and mitigating the challenges posed by dependent data in statistical learning. By expanding the class of problems for which mixing-free rates can be achieved, we provide a foundation for further exploration into learning algorithms that are robust to data dependencies.

Conclusion

This paper marks a significant step towards overcoming the limitations imposed by dependent data in statistical learning. By achieving near mixing-free rates, we pave the way for the development of learning algorithms that are both theoretically sound and practically applicable in settings where data does not adhere to the traditional i.i.d. assumption. Future work will likely explore extensions of these results to other loss functions and learning models, further broadening the scope and applicability of learning from dependent data.

PDF Markdown