Bias and Error Mitigation in Software-Generated Data: An Advanced Search and Optimization Framework Leveraging Generative Code Models (2310.11546v1)
Abstract: Data generation and analysis is a fundamental aspect of many industries and disciplines, from strategic decision making in business to research in the physical and social sciences. However, data generated using software and algorithms can be subject to biases and errors. These can be due to problems with the original software, default settings that do not align with the specific needs of the situation, or even deeper problems with the underlying theories and models. This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions to address typical problems in software systems specializing in data analysis and generation, especially those in the corporate and data science world. Applying this framework multiple times on the same software system would incrementally improve the quality of the output results. It uses Solomonoff Induction as a sound theoretical basis, extending it with Kolmogorov Conditional Complexity, a novel adaptation, to evaluate a set of candidate programs. We propose the use of generative models for the creation of this set of programs, with special emphasis on the capabilities of LLMs to generate high quality code.
- Claude E Shannon “A Mathematical Theory of Communication” In The Bell System Technical Journal 27.3 AT&T, 1948, pp. 379–423
- Ray J Solomonoff “A formal theory of inductive inference. Part I” In Information and control 7.1, 1964, pp. 1–22
- Christopher M Bishop “Neural networks for pattern recognition” In Oxford university press, 1995
- George Casella and Roger L. Berger “Statistical Inference” Duxbury Pacific Grove, CA, 2002
- Christopher M. Bishop “Pattern Recognition and Machine Learning” Springer, 2006
- Marcus Hutter “Universal algorithmic intelligence: A mathematical top→→\rightarrow→ down approach” In arXiv preprint arXiv:0706.0557, 2007
- “An introduction to Kolmogorov complexity and its applications” In Springer Verlag New York Inc. 3.10, 2008, pp. 5–75
- Trevor Hastie, Robert Tibshirani and Jerome Friedman “The Elements of Statistical Learning” Springer, 2009
- “A philosophical treatise of universal induction” In Entropy 13.6 Multidisciplinary Digital Publishing Institute, 2011, pp. 1076–1136
- Marcus Hutter and David L. Dowe “Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence” 7070, Lecture Notes in Computer Science Springer, 2013
- “Pro Git” Apress, 2014 URL: https://git-scm.com/book/en/v2
- Fouad B. Chedid “Kolmogorov Complexity and Information Content” In CoRR abs/1710.06846, 2017 arXiv: http://arxiv.org/abs/1710.06846
- “Language Models are Few-Shot Learners” In arXiv preprint arXiv:2005.14165, 2020
- OpenAI “Generative Pre-trained Transformer 3 (GPT-3)”, https://openai.com/research/gpt-3, 2020
- Paul M.B. Viányi “How incomputable is Kolmogorov complexity?” In CoRR abs/2002.07674, 2020 arXiv: https://arxiv.org/abs/2002.07674
- GitHub “GitHub Copilot: Your AI pair programmer”, https://github.com/github/copilot, 2022
- Meta AI “Llama 2”, https://ai.meta.com/research/llama-2, 2023
- OpenAI “GPT-4”, https://openai.com/research/gpt-4, 2023