Polar Encoding: A Simple Baseline Approach for Classification with Missing Values
Abstract: We propose polar encoding, a representation of categorical and numerical $[0,1]$-valued attributes with missing values to be used in a classification context. We argue that this is a good baseline approach, because it can be used with any classification algorithm, preserves missingness information, is very simple to apply and offers good performance. In particular, unlike the existing missing-indicator approach, it does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values, thereby providing a practical realisation of the "missingness incorporated in attributes" (MIA) proposal. Furthermore, we show that categorical and $[0,1]$-valued attributes can be viewed as special cases of a single attribute type, corresponding to the classical concept of barycentric coordinates, and that this offers a natural interpretation of polar encoding as a fuzzified form of one-hot encoding. With an experiment based on twenty real-life datasets with missing values, we show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies "multiple imputation by chained equations" (MICE) and "multiple imputation with denoising autoencoders" (MIDAS) and -- depending on the classifier -- about as well or better than mean/mode imputation with missing-indicators.
- C. Garcia, D. Leite, and I. Škrjanc, “Incremental missing-data imputation for evolving fuzzy granular prediction,” IEEE Trans. Fuzzy Syst., vol. 28, no. 10, pp. 2348–2362, 2020.
- W. Zhang, Z. Deng, T. Zhang, K.-S. Choi, J. Wang, and S. Wang, “Incomplete multi-view fuzzy inference system with missing view imputation and cooperative learning,” IEEE Trans. Fuzzy Syst., vol. 30, no. 8, pp. 3038–3051, 2022.
- D. Li, H. Zhang, T. Li, A. Bouras, X. Yu, and T. Wang, “Hybrid missing value imputation algorithms using fuzzy c-means and vaguely quantified rough set,” IEEE Trans. Fuzzy Syst., vol. 30, no. 5, pp. 1396–1408, 2022.
- D. B. Rubin, “Multiple imputations in sample surveys — a phenomenological Bayesian approach to nonresponse,” in Proc. Surv. Res. Methods Sect. Am. Statist. Assoc. American Statistical Association, 1978, pp. 20–34.
- S. Van Buuren and K. Oudshoorn, “Flexible multivariate imputation by MICE,” TNO Prevention and Health, Leiden, Tech. Rep. PG/VGZ/99.054, 1999.
- R. Lall and T. Robinson, “The MIDAS touch: Accurate and scalable missing-data imputation with deep learning,” Political Anal., vol. 30, no. 2, pp. 179–196, 2022.
- J. Cohen, “Multiple regression as a general data-analytic system,” Psychol. Bull., vol. 70, no. 6, pp. 426–443, 1968.
- M. P. Jones, “Indicator and stratification methods for missing explanatory variables in multiple linear regression,” J. Am. Statist. Assoc., vol. 91, no. 433, pp. 222–230, 1996.
- O. U. Lenz, D. Peralta, and C. Cornelis, “No imputation without representation,” arXiv preprint 2206.14254, 2022.
- B. E. Twala, M. Jones, and D. J. Hand, “Good methods for coping with missing data in decision trees,” Pattern Recognit. Lett., vol. 29, no. 7, pp. 950–956, 2008.
- J. Josse, N. Prost, E. Scornet, and G. Varoquaux, “On the consistency of supervised learning with missing values,” arXiv preprint 1902.06931, 2020.
- D. B. Suits, “Use of dummy variables in regression equations,” J. Am. Statist. Assoc., vol. 52, no. 280, pp. 548–551, 1957.
- A. Perez-Lebel, G. Varoquaux, M. Le Morvan, J. Josse, and J.-B. Poline, “Benchmarking missing-values approaches for predictive models on health databases,” GigaScience, vol. 11, no. 1, 2022, art. no. giac013.
- R. J. Boscovich, “De litteraria expeditione per pontificiam ditionem,” De bononiensi scientarium et artium instituto atque academia commentarii, vol. 4, pp. 353–396 (opuscula), 1757.
- ——, “De recentissimis graduum dimensionibus, et figura, ac magnitudine terræ inde derivanda,” in Philosophiæ recentioris, B. Stay. Rome: Nicolaus et Marcus Palearini, 1760, vol. 2, pp. 406–426.
- C. Eisenhart, “Boscovich and the combination of observations,” in Roger Joseph Boscovich, S.J., F.R.S., 1711–1787: Studies of his Life and Work on the 250th Anniversary of his Birth, L. L. Whyte, Ed. London: George Allen & Unwin, 1961, ch. 9, pp. 200–212.
- J. Dai, “Rough set approach to incomplete numerical data,” Inf. Sci., vol. 241, pp. 43–57, 2013.
- O. U. Lenz, D. Peralta, and C. Cornelis, “Adapting fuzzy rough sets for classification with missing values,” in Proc. Int. Joint Conf. Rough Sets. Springer, 2021, pp. 192–200.
- R. Jensen and Q. Shen, “Interval-valued fuzzy-rough feature selection in datasets with missing values,” in Proc. 18th IEEE Int. Conf. Fuzzy Syst. IEEE, 2009, pp. 610–615.
- D. Shelupsky, “A generalization of the trigonometric functions,” Am. Math. Monthly, vol. 66, no. 10, pp. 879–884, 1959.
- P. Lindqvist and J. Peetre, “p𝑝pitalic_p-arclength of the q𝑞qitalic_q-circle,” Lund University, Centre for Mathematical Sciences, Tech. Rep. Preprint 2000:21 LUNFMA-5014-2000, 2000.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825–2830, 2011.
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “LightGBM: A highly efficient gradient boosting decision tree,” in Proc. 31st Conf. Neural Inf. Process. Syst. NIPS Foundation, 2017, pp. 3146–3154.
- T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 785–794.
- B. Cestnik, I. Kononenko, and I. Bratko, “ASSISTANT 86: A knowledge-elicitation tool for sophisticated users,” in Proc. 2nd Eur. Work. Session Learn. Sigma Press, 1987, pp. 31–45.
- B. Twala, “An empirical comparison of techniques for handling incomplete data using decision trees,” Appl. Artif. Intell., vol. 23, no. 5, pp. 373–405, 2009.
- A. Kapelner and J. Bleich, “Prediction with missing data via Bayesian additive regression trees,” Can. J. Statist., vol. 43, no. 2, pp. 224–239, 2015.
- J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986.
- CIA World Factbook, “GDP — composition, by sector of origin,” 2022. [Online]. Available: https://www.cia.gov/the-world-factbook/field/gdp-composition-by-sector-of-origin/
- R. E. Allardice, “The barycentric calculus of Möbius,” Proc. Edinburgh Math. Soc., vol. 10, pp. 2–21, 1891.
- C. Huang, D. R. Rice, and J. H. Steffen, “MAGRATHEA: an open-source spherical symmetric planet interior structure code,” Monthly Notices Roy. Astron. Soc., vol. 513, no. 4, pp. 5256–5269, 2022.
- M. G. MacDonald, L. Feil, T. Quinn, and D. Rice, “Confirming the 3:2 resonance chain of K2-138,” Astron. J., vol. 163, no. 4, 2022, art. no. 162.
- J. Haldemann, V. Ksoll, D. Walter, Y. Alibert, R. S. Klessen, W. Benz, U. Koethe, L. Ardizzone, and C. Rother, “Exoplanet characterization using conditional invertible neural networks,” Astron. & Astroph., vol. 672, 2023, art. no. A180.
- F. Wang, J. Yu, Z. Liu, M. Kong, and Y. Wu, “Study on offshore seabed sediment classification based on particle size parameters using XGBoost algorithm,” Comput. Geosci., vol. 149, 2021, art. no. 104713.
- S. Stemplinger, S. Prévost, T. Zemb, D. Horinek, and J.-F. Dufrêche, “Theory of ternary fluids under centrifugal fields,” J. Phys. Chem. B, vol. 125, no. 43, pp. 12 054–12 062, 2021.
- M. Tönsmann, D. T. Ewald, P. Scharfer, and W. Schabel, “Surface tension of binary and ternary polymer solutions: Experimental data of poly(vinyl acetate), poly(vinyl alcohol) and polyethylene glycol solutions and mixing rule evaluation over the entire concentration range,” Surf. Interface, vol. 26, 2021, art. no. 101352.
- W.-C. Chen, J. N. Schmidt, D. Yan, Y. K. Vohra, and C.-C. Chen, “Machine learning and evolutionary prediction of superhard bcn compounds,” npj Comput. Mater., vol. 7, 2021, art. no. 114.
- A. M. Nolan, E. D. Wachsman, and Y. Mo, “Computation-guided discovery of coating materials to stabilize the interface between lithium garnet solid electrolyte and high-energy cathodes for all-solid-state lithium batteries,” Energy Storage Mater., vol. 41, pp. 571–580, 2021.
- M. Kim, J.-K. Choi, and S. K. Baek, “Win-stay-lose-shift as a self-confirming equilibrium in the iterated prisoner’s dilemma,” Proc. Roy. Soc. B, vol. 288, no. 1953, 2021, art. no. 20211021.
- F. Molter, A. W. Thomas, S. A. Huettel, H. R. Heekeren, and P. N. Mohr, “Gaze-dependent evidence accumulation predicts multi-alternative risky choice behaviour,” PLoS Comput. Biol., vol. 18, no. 7, 2022, art. no. e1010283.
- R. Zhao and K. Mao, “Fuzzy bag-of-words model for document representation,” IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 794–804, 2018.
- Z.-P. Tian, R.-X. Nie, J.-Q. Wang, and R.-Y. Long, “Adaptive consensus-based model for heterogeneous large-scale group decision-making: Detecting and managing noncooperative behaviors,” IEEE Trans. Fuzzy Syst., vol. 29, no. 8, pp. 2209–2223, 2021.
- J. W. Sangma, Y. Rani, V. Pal, N. Kumar, and R. Kushwaha, “FHC-NDS: Fuzzy hierarchical clustering of multiple nominal data streams,” IEEE Trans. Fuzzy Syst., vol. 31, no. 3, pp. 786–798, 2023.
- E. H. Ruspini, “A new approach to clustering,” Inf. Control, vol. 15, no. 1, pp. 22–32, 1969.
- J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” J. Cybern., vol. 3, no. 3, pp. 32–57, 1974.
- J. C. Bezdek and J. D. Harris, “Fuzzy partitions and relations; an axiomatic basis for clustering,” Fuzzy Sets Syst., vol. 1, no. 2, pp. 111–127, 1978.
- E. Fix and J. Hodges, Jr, “Discriminatory analysis — nonparametric discrimination: Consistency properties,” USAF School of Aviation Medicine, Randolph Field, Texas, Tech. Rep. 21-49-004, 1951.
- S. A. Dudani, “The distance-weighted k𝑘kitalic_k-nearest-neighbor rule,” IEEE Trans. Syst., Man, Cybern., vol. 6, no. 4, pp. 325–327, 1976.
- R. Jensen and C. Cornelis, “A new approach to fuzzy-rough nearest neighbour classification,” in Proc. 6th Int. Conf. Rough Sets Current Trends Comput. Springer, 2008, pp. 310–319.
- C. Cornelis, N. Verbiest, and R. Jensen, “Ordered weighted average based fuzzy rough sets,” in Proc. 5th Int. Conf. Rough Set Knowl. Technol. Springer, 2010, pp. 78–85.
- C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995.
- L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
- P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006.
- Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in Proc. 2nd Eur. Conf. Comput. Learn. Theory. Springer, 1995, pp. 23–37.
- J. Zhu, H. Zou, S. Rosset, and T. Hastie, “Multi-class AdaBoost,” Statist. Its Interface, vol. 2, no. 3, pp. 349–360, 2009.
- J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, 2001.
- S. V. Wilson, “miceforest: Fast, memory efficient imputation with LightGBM,” 2020. [Online]. Available: https://github.com/AnotherSamWilson/miceforest
- R. Lall and T. Robinson, “Efficient multiple imputation for diverse data in python and r: Midaspy and rmidas,” J. Statist. Softw., vol. 107, 2023, art. no. 9.
- R. Kohavi, “Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid,” in Proc. 2nd Int. Conf. Knowl. Discovery Data Mining. AAAI Press, 1996, pp. 202–207.
- J. C. Schlimmer, “Concept acquisition through representational adjustment,” Ph.D. dissertation, University of California, Irvine, 1987.
- C. Ferreira Costa and M. A. Nascimento, “IDA 2016 industrial challenge: Using machine learning for predicting failures,” in Proc. 15th Int. Symp. Intell. Data Anal. Springer, 2016, pp. 381–386.
- H. A. Güvenir, B. Acar, G. Demiröz, and A. Çekin, “A supervised machine learning algorithm for arrhythmia analysis,” in Proc. 24th Annu. Meeting Comput. Cardiol. IEEE, 1997, pp. 433–436.
- B. Evans and D. Fisher, “Overcoming process delays with decision tree induction,” IEEE Expert, vol. 9, no. 1, pp. 60–66, 1994.
- L. J. Rubini and P. Eswaran, “Generating comparative analysis of early stage prediction of chronic kidney disease,” Int. J. Modern Eng. Res., vol. 5, no. 7, pp. 49–55, 2015.
- J. R. Quinlan, “Simplifying decision trees,” Int. J. Man-Mach. Stud., vol. 27, no. 3, pp. 221–234, 1987.
- P. Soltani Zarrin, N. Röckendorf, and C. Wenger, “In-vitro classification of saliva samples of COPD patients and healthy controls using machine learning tools,” IEEE Access, vol. 8, pp. 168 053–168 060, 2020.
- M. S. Santos, P. H. Abreu, P. J. García-Laencina, A. Simão, and A. Carvalho, “A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients,” J. Biomed. Inform., vol. 58, pp. 49–59, 2015.
- R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J.-J. Schmid, S. Sandhu, K. H. Guppy, S. Lee, and V. Froelicher, “International application of a new probability algorithm for the diagnosis of coronary artery disease,” Am. J. Cardiol., vol. 64, no. 5, pp. 304–310, 1989.
- B. Efron and G. Gong, “Statistical theory and the computer,” in Comput. Sci. Statist.: Proc. 13th Symp. Interface. Springer, 1981, pp. 3–7.
- M. McLeish and M. Cecile, “Enhancing medical expert systems with knowledge obtained from statistical data,” Ann. Math. Artif. Intell., vol. 2, no. 1–4, pp. 261–276, 1990.
- M. Elter, R. Schulz-Wendtland, and T. Wittenberg, “The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process,” Med. Phys., vol. 34, no. 11, pp. 4164–4172, 2007.
- S. E. Golovenkin, J. Bac, A. Chervov, E. M. Mirkes, Y. V. Orlova, E. Barillot, A. N. Gorban, and A. Zinovyev, “Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data,” GigaScience, vol. 9, no. 11, 2020, art. no. giaa128.
- L. Candillier and V. Lemaire, “Design and analysis of the nomao challenge: Active learning in the real-world,” in Act. Learn. Real-world Appl. Workshop, ECML-PKDD 2012.
- M. McCann, Y. Li, L. Maguire, and A. Johnston, “Causality challenge: benchmarking relevant signal components for effective monitoring and process control,” in Proc. NIPS 2008 Workshop Causality. JMLR Workshop and Conference Proceedings, 2008, pp. 277–288.
- R. S. Michalski and R. L. Chilausky, “Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis,” Int. J. Policy Anal. Inf. Syst., vol. 4, no. 2, pp. 125–161, 1980.
- J. R. Quinlan, P. J. Compton, K. A. Horn, and L. Lazarus, “Inductive knowledge acquisition: a case study,” in Proc. 2nd Aust. Conf. Appl. Expert Syst. Turing Institute Press, 1986, pp. 157–173.
- D. Dua and C. Graff, “UCI machine learning repository,” 2019. [Online]. Available: http://archive.ics.uci.edu/ml
- D. J. Hand and R. J. Till, “A simple generalisation of the area under the ROC curve for multiple class classification problems,” Mach. Learn., vol. 45, no. 2, pp. 171–186, 2001.
- F. Wilcoxon, “Individual comparisons by ranking methods,” Biomed. Bull., vol. 1, no. 6, pp. 80–83, 1945.
- O. U. Lenz, C. Cornelis, and D. Peralta, “fuzzy-rough-learn 0.2: a Python library for fuzzy rough set algorithms and one-class classification,” in Proc. 2022 IEEE Int. Conf. Fuzzy Syst. IEEE.
- B. Rosner, R. J. Glynn, and M.-L. T. Lee, “The Wilcoxon signed rank test for paired comparisons of clustered data,” Biometrics, vol. 62, no. 1, pp. 185–192, 2006.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.