Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hardness of Learning Boolean Functions from Label Proportions (2403.19401v1)

Published 28 Mar 2024 in cs.CC, cs.DS, and cs.LG

Abstract: In recent years the framework of learning from label proportions (LLP) has been gaining importance in machine learning. In this setting, the training examples are aggregated into subsets or bags and only the average label per bag is available for learning an example-level predictor. This generalizes traditional PAC learning which is the special case of unit-sized bags. The computational learning aspects of LLP were studied in recent works (Saket, NeurIPS'21; Saket, NeurIPS'22) which showed algorithms and hardness for learning halfspaces in the LLP setting. In this work we focus on the intractability of LLP learning Boolean functions. Our first result shows that given a collection of bags of size at most $2$ which are consistent with an OR function, it is NP-hard to find a CNF of constantly many clauses which satisfies any constant-fraction of the bags. This is in contrast with the work of (Saket, NeurIPS'21) which gave a $(2/5)$-approximation for learning ORs using a halfspace. Thus, our result provides a separation between constant clause CNFs and halfspaces as hypotheses for LLP learning ORs. Next, we prove the hardness of satisfying more than $1/2 + o(1)$ fraction of such bags using a $t$-DNF (i.e. DNF where each term has $\leq t$ literals) for any constant $t$. In usual PAC learning such a hardness was known (Khot-Saket, FOCS'08) only for learning noisy ORs. We also study the learnability of parities and show that it is NP-hard to satisfy more than $(q/2{q-1} + o(1))$-fraction of $q$-sized bags which are consistent with a parity using a parity, while a random parity based algorithm achieves a $(1/2{q-2})$-approximation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. The hardness of approximate optima in lattices, codes, and systems of linear equations. J. Comput. Syst. Sci., 54(2):317–331, 1997.
  2. Proof verification and the hardness of approximation problems. J. ACM, 45(3):501–555, 1998.
  3. S. Arora and S. Safra. Probabilistic checking of proofs: A new characterization of NP. J. ACM, 45(1):70–122, 1998.
  4. D. Barucic and J. Kybic. Fast learning from label proportions with small bags. CoRR, abs/2110.03426, 2021.
  5. Deep learning from label proportions for emphysema quantification. In MICCAI, volume 11071 of Lecture Notes in Computer Science, pages 768–776. Springer, 2018.
  6. Easy learning from label proportions. arXiv, 2023.
  7. Learning from aggregated data: Curated bags versus random bags. arXiv, 2023.
  8. Cost-based labeling of groups of mass spectra. In Proc. ACM SIGMOD International Conference on Management of Data, pages 167–178, 2004.
  9. Weakly supervised classification in high energy physics. Journal of High Energy Physics, 2017(5):1–11, 2017.
  10. Agnostic learning of monomials by halfspaces is hard. SIAM J. Comput., 41(6):1558–1590, 2012.
  11. S. Ghoshal and R. Saket. Hardness of learning DNFs using halfspaces. In Proc. STOC, pages 467–480, 2021.
  12. Bypassing UGC from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12(1):6:1–6:25, 2016.
  13. J. Håstad. Some optimal inapproximability results. J. ACM, 48(4):798–859, 2001.
  14. Fitting the data from embryo implantation prediction: Learning from label proportions. Statistical methods in medical research, 27(4):1056–1066, 2018.
  15. S. Khot and R. Saket. Hardness of minimizing and learning DNF expressions. In Proc. FOCS, pages 231–240, 2008.
  16. Challenges and approaches to privacy preserving post-click conversion prediction. CoRR, abs/2201.12666, 2022.
  17. Quantifying emphysema extent from weakly labeled ct scans of the lungs using label proportions learning. In The Sixth International Workshop on Pulmonary Image Analysis, pages 31–42, 2016.
  18. R. O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  19. R. Raz. A parallel repetition theorem. SIAM J. Comput., 27(3):763–803, 1998.
  20. S. Rueping. SVM classifier estimation from group probabilities. In Proc. ICML, pages 911–918, 2010.
  21. R. Saket. Learnability of linear thresholds from label proportions. In Proc. NeurIPS, 2021.
  22. R. Saket. Algorithms and hardness for learning linear thresholds from label proportions. In Proc. NeurIPS, 2022.
  23. L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.
  24. Using published medical results and non-homogenous data in rule learning. In Proc. International Conference on Machine Learning and Applications and Workshops, volume 2, pages 84–89. IEEE, 2011.
  25. On learning from label proportions. CoRR, abs/1402.5902, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Venkatesan Guruswami (128 papers)
  2. Rishi Saket (20 papers)

Summary

We haven't generated a summary for this paper yet.