Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables (2403.06367v1)

Published 11 Mar 2024 in cs.LG and cs.DB

Abstract: Feature augmentation from one-to-many relationship tables is a critical but challenging problem in ML model development. To augment good features, data scientists need to come up with SQL queries manually, which is time-consuming. Featuretools [1] is a widely used tool by the data science community to automatically augment the training data by extracting new features from relevant tables. It represents each feature as a group-by aggregation SQL query on relevant tables and can automatically generate these SQL queries. However, it does not include predicates in these queries, which significantly limits its application in many real-world scenarios. To overcome this limitation, we propose FEATAUG, a new feature augmentation framework that automatically extracts predicate-aware SQL queries from one-to-many relationship tables. This extension is not trivial because considering predicates will exponentially increase the number of candidate queries. As a result, the original Featuretools framework, which materializes all candidate queries, will not work and needs to be redesigned. We formally define the problem and model it as a hyperparameter optimization problem. We discuss how the Bayesian Optimization can be applied here and propose a novel warm-up strategy to optimize it. To make our algorithm more practical, we also study how to identify promising attribute combinations for predicates. We show that how the beam search idea can partially solve the problem and propose several techniques to further optimize it. Our experiments on four real-world datasets demonstrate that FeatAug extracts more effective features compared to Featuretools and other baselines. The code is open-sourced at https://github.com/sfu-db/FeatAug

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automating data science endeavors,” in 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Campus des Cordeliers, Paris, France, October 19-21, 2015.   IEEE, 2015, pp. 1–10. [Online]. Available: https://doi.org/10.1109/DSAA.2015.7344858
  2. T. Vafeiadis, K. I. Diamantaras, G. Sarigiannidis, and K. C. Chatzisavvas, “A comparison of machine learning techniques for customer churn prediction,” Simulation Modelling Practice and Theory, vol. 55, pp. 1–9, 2015.
  3. G. Liu, T. T. Nguyen, G. Zhao, W. Zha, J. Yang, J. Cao, M. Wu, P. Zhao, and W. Chen, “Repeat buyer prediction for e-commerce,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 155–164.
  4. R. Malhotra and D. K. Malhotra, “Evaluating consumer loans using neural networks,” Omega, vol. 31, no. 2, pp. 83–96, 2003.
  5. P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
  6. G. Katz, E. C. R. Shin, and D. Song, “Explorekit: Automatic feature generation and selection,” in IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, Z. Zhou, and X. Wu, Eds.   IEEE Computer Society, 2016, pp. 979–984. [Online]. Available: https://doi.org/10.1109/ICDM.2016.0123
  7. Y. Luo, M. Wang, H. Zhou, Q. Yao, W. Tu, Y. Chen, W. Dai, and Q. Yang, “Autocross: Automatic feature crossing for tabular data in real-world applications,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis, Eds.   ACM, 2019, pp. 1936–1945. [Online]. Available: https://doi.org/10.1145/3292500.3330679
  8. F. Horn, R. Pack, and M. Rieger, “The autofeat python library for automated feature engineering and selection,” in Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16-20, 2019, Proceedings, Part I, ser. Communications in Computer and Information Science, P. Cellier and K. Driessens, Eds., vol. 1167.   Springer, 2019, pp. 111–120. [Online]. Available: https://doi.org/10.1007/978-3-030-43823-4_10
  9. T. Zhang, Z. Zhang, Z. Fan, H. Luo, F. Liu, W. Cao, and J. Li, “Openfe: Automated feature generation beyond expert-level performance,” CoRR, vol. abs/2211.12507, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.12507
  10. L. Li, H. Wang, L. Zha, Q. Huang, S. Wu, G. Chen, and J. Zhao, “Learning a data-driven policy network for pre-training automated feature engineering,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=688hNNMigVX
  11. W. Fan, E. Zhong, J. Peng, O. Verscheure, K. Zhang, J. Ren, R. Yan, and Q. Yang, “Generalized and heuristic-free feature construction for improved accuracy,” in Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA.   SIAM, 2010, pp. 629–640. [Online]. Available: https://doi.org/10.1137/1.9781611972801.55
  12. Q. Shi, Y. Zhang, L. Li, X. Yang, M. Li, and J. Zhou, “SAFE: scalable automatic feature engineering framework for industrial tasks,” in 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020.   IEEE, 2020, pp. 1645–1656. [Online]. Available: https://doi.org/10.1109/ICDE48307.2020.00146
  13. F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, and D. S. Turaga, “Learning feature engineering for classification,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, C. Sierra, Ed.   ijcai.org, 2017, pp. 2529–2535. [Online]. Available: https://doi.org/10.24963/ijcai.2017/352
  14. N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. R. Karger, “ARDA: automatic relational data augmentation for machine learning,” Proc. VLDB Endow., vol. 13, no. 9, pp. 1373–1387, 2020. [Online]. Available: http://www.vldb.org/pvldb/vol13/p1373-chepurko.pdf
  15. J. Liu, C. Chai, Y. Luo, Y. Lou, J. Feng, and N. Tang, “Feature augmentation with reinforcement learning,” in 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022.   IEEE, 2022, pp. 3360–3372. [Online]. Available: https://doi.org/10.1109/ICDE53745.2022.00317
  16. M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time series feature extraction on basis of scalable hypothesis tests (tsfresh - A python package),” Neurocomputing, vol. 307, pp. 72–77, 2018. [Online]. Available: https://doi.org/10.1016/j.neucom.2018.03.067
  17. D. Qi, J. Peng, Y. He, and J. Wang, “Auto-fp: An experimental study of automated feature preprocessing for tabular data,” in Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28, L. Tanca, Q. Luo, G. Polese, L. Caruccio, X. Oriol, and D. Firmani, Eds.   OpenProceedings.org, 2024, pp. 129–142. [Online]. Available: https://doi.org/10.48786/edbt.2024.12
  18. G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
  19. I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.
  20. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “Infogather: entity augmentation and attribute discovery by holistic matching with web tables,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 97–108.
  21. M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integration for the relational web,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1090–1101, 2009.
  22. J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang, “A hybrid machine-crowdsourcing system for matching web tables,” in 2014 IEEE 30th International Conference on Data Engineering.   IEEE, 2014, pp. 976–987.
  23. P. Wang, R. Shea, J. Wang, and E. Wu, “Progressive deep web crawling through keyword queries for data enrichment,” in Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 229–246.
  24. L. Zhao, Q. Li, P. Wang, J. Wang, and E. Wu, “Activedeeper: a model-based active data enrichment system,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 2885–2888, 2020.
  25. P. Wang, Y. He, R. Shea, J. Wang, and E. Wu, “Deeper: A data enrichment system powered by deep web,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1801–1804.
  26. J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. 10, pp. 281–305, 2012. [Online]. Available: http://jmlr.org/papers/v13/bergstra12a.html
  27. F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. Murphy, “Time-bounded sequential parameter optimization,” in International Conference on Learning and Intelligent Optimization.   Springer, 2010, pp. 281–298.
  28. M. Schonlau, W. J. Welch, and D. R. Jones, “Global versus local search in constrained optimization of computer models,” Lecture Notes-Monograph Series, p. 11–25, 1998. [Online]. Available: http://www.jstor.org/stable/4356058
  29. J. Bergstra, D. Yamins, and D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” in Proceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 1.   Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 115–123. [Online]. Available: http://proceedings.mlr.press/v28/bergstra13.html
  30. J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., vol. 24.   Curran Associates, Inc., 2011. [Online]. Available: https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
  31. F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in LION’05 Proceedings of the 5th international conference on Learning and Intelligent Optimization, 2011, pp. 507–523.
  32. L. Li, K. G. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, “Hyperband: A novel bandit-based approach to hyperparameter optimization,” J. Mach. Learn. Res., vol. 18, pp. 185:1–185:52, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-558.html
  33. S. Falkner, A. Klein, and F. Hutter, “BOHB: robust and efficient hyperparameter optimization at scale,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80.   PMLR, 2018, pp. 1436–1445. [Online]. Available: http://proceedings.mlr.press/v80/falkner18a.html
  34. L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, “Hyperband: A novel bandit-based approach to hyperparameter optimization,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6765–6816, 2017.
  35. F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in International conference on learning and intelligent optimization.   Springer, 2011, pp. 507–523.
  36. M. F. Medress, F. S. Cooper, J. W. Forgie, C. Green, D. H. Klatt, M. H. O’Malley, E. P. Neuburg, A. Newell, D. Reddy, B. Ritea et al., “Speech understanding systems: Report of a steering committee,” Artificial Intelligence, vol. 9, no. 3, pp. 307–316, 1977.
  37. H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
  38. G. Brown, A. Pocock, M.-J. Zhao, and M. Luján, “Conditional likelihood maximisation: a unifying framework for information theoretic feature selection,” The journal of machine learning research, vol. 13, pp. 27–66, 2012.
  39. J. Blackard, “Covtype Dataset,” UCI Machine Learning Repository, 1998, DOI: https://doi.org/10.24432/C50K5N.
  40. “Household dataset,” https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction/overview, 2018.
  41. “Ijcai-15 repeat buyers prediction dataset,” https://tianchi.aliyun.com/dataset/dataDetail?dataId=42, 2015.
  42. “Instacart market basket analysis,” https://www.kaggle.com/c/instacart-market-basket-analysis, 2017.
  43. “Student dataset,” https://www.kaggle.com/competitions/predict-student-performance-from-game-play, 2023.
  44. “Elo merchant category recommendation,” https://www.kaggle.com/competitions/elo-merchant-category-recommendation, 2018.
  45. “Feataug: Automatic feature augmentation from one-to-many relationship tables (technical report),” https://github.com/sfu-db/FeatAug/blob/main/FeatAug(Technical_Report).pdf, 2023.
  46. H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorization-machine based neural network for CTR prediction,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, C. Sierra, Ed.   ijcai.org, 2017, pp. 1725–1731. [Online]. Available: https://doi.org/10.24963/ijcai.2017/239
  47. “State of Data Science and Machine Learning 2021.” https://www.kaggle.com/kaggle-survey-2021, 2021.
  48. J. Bergstra, D. Yamins, D. D. Cox et al., “Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms,” in Proceedings of the 12th Python in science conference, vol. 13.   Citeseer, 2013, p. 20.
  49. “Scikit-learn: Machine learning in python,” https://scikit-learn.org/stable/, 2023.
  50. M. Kelly, R. Longjohn, and K. Nottingham, “The uci machine learning repository,” https://archive.ics.uci.edu, 2024.
  51. “Demos of featuretools,” https://www.featuretools.com/demos/, 2018.
  52. “Demo of featuretools on household dataset,” https://github.com/alteryx/predict-household-poverty, 2018.
  53. C. Chai, J. Liu, N. Tang, J. Fan, D. Miao, J. Wang, Y. Luo, and G. Li, “Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data,” Proc. ACM Manag. Data, vol. 1, no. 2, pp. 157:1–157:27, 2023. [Online]. Available: https://doi.org/10.1145/3589302
  54. J. Wang, C. Chai, N. Tang, J. Liu, and G. Li, “Coresets over multiple tables for feature-rich and data-efficient machine learning,” Proc. VLDB Endow., vol. 16, no. 1, pp. 64–76, 2022. [Online]. Available: https://www.vldb.org/pvldb/vol16/p64-wang.pdf
  55. C. Chai, J. Wang, N. Tang, Y. Yuan, J. Liu, Y. Deng, and G. Wang, “Efficient coreset selection with cluster-based methods,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, A. K. Singh, Y. Sun, L. Akoglu, D. Gunopulos, X. Yan, R. Kumar, F. Ozcan, and J. Ye, Eds.   ACM, 2023, pp. 167–178. [Online]. Available: https://doi.org/10.1145/3580305.3599326
Citations (1)

Summary

We haven't generated a summary for this paper yet.