Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
107 tokens/sec
Gemini 2.5 Pro Premium
58 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
84 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

Balanced Subsampling for Big Data with Categorical Covariates (2212.12595v2)

Published 23 Dec 2022 in stat.ME

Abstract: Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource limitations. Subsampling algorithms offer a solution by selecting a subset from the design points for observing the response. Existing subsampling methods primarily assume numerical predictors, neglecting the prevalent occurrence of big data with categorical predictors across various disciplines. This paper proposes a novel balanced subsampling approach tailored for data with categorical predictors. A balanced subsample significantly reduces the cost of observing the response and possesses three desired merits. First, it is nonsingular and, therefore, allows linear regression with all dummy variables encoded from categorical predictors. Second, it offers optimal parameter estimation by minimizing the generalized variance of the estimated parameters. Third, it allows robust prediction in the sense of minimizing the worst-case prediction error. We demonstrate the superiority of balanced subsampling over existing methods through extensive simulation studies and a real-world application.

Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)